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Abstract 

Speech coding is important in the effort to make more efficient use of digital telecom- 
munication networks, particularly wireless systems, and to reduce the memory re- 
quirements in speech storage systems. The desire for a low-rate digital representation 
of speech is often contrary to the demand for a high quality speech reconstruction. 
In this thesis we present a new speech compression technique designed for near toll 
quality speech coding at bit rates as low as 4 kb/s. 

In low-rate speech coding based on linear prediction (LP), poor modelling of the 
LP excitation for voiced, quasi-periodic segments contributes to the degradation of 
the quality of the reconstructed speech. In this dissertation, we present a new speech 
coding method designed for improved modelling of the LP excitation. 

Conceptually, the LP excitation is decomposed into a series of underlying pitch 
pulses and a simultaneous unvoiced noise-like signal. The underlying pitch pulses are 
estimated from noisy observations, i.e. the pitch pulses extracted from the LP resid- 
ual. Since the pulses change little from one time instant to another, we call our 
representation the Pitch Pulse Evolution (PPE) model. The PPE model provides a 
framework to analyze and effectively control the periodicity of voiced speech. 

We have developed a robust algorithm for extracting noisy pitch pulses from the 
LP residual based on error minimization with respect to a set of model pulses, and 
we have examined a number of methods for calculating the underlying pulses. The 
evolving pitch pulse waveshapes, the pulse positions, and the unvoiced signal are 
encoded separately. The positions and the shapes of the underlying pulses need only 
be coded infrequently, and the characteristics of intermediate pulses are obtained by 
interpolation. 

The software implementation of a 4 kb/s PPE coder is described. The main fea- 
tures of the implemented PPE coder are: a novel approach to pitch analysis; estima- 
tion of evolving pitch pulses which enables control over the pulse characteristics; and 
a unique coding scheme which avoids the time dilation and contraction of individual 
pitch pulses found in other waveform interpolation coders. 
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Le codage de la parole est essentiel dans les efforts pour obtenir un usage plus efficace des 
reseaux de telecommunication numeriques, en particulier les reseaux cellulaires. et pour 
reduire la memoire necessaire dans les systemes de stockage de la parole. La volonte d’avoir 
une representation numerique de la parole a faible debit n’est pas souvent compatible avec 
la demande d’une reconstruction de la parole de haute qualite. Dans cette these, nous 
presentons une nouvelle methode de compression de la parole permettant d’obtenir une 
reconstruction fidele a des debits aussi faible que 4 kb/s. 

Dans le codage de la parole a faible debit utilisant une prediction lineaire (LP), la pauvre 
modelisation de l’excitation LP pour les segments voises quasi periodiques contribue a la 
degradation de la qualite de la parole reconstruite. Dans cette these, nous presentons une 
nouvelle methode de codage de la parole congue pour un meilleur modele de l’excitation LP. 

Conceptuellement, l’excitation LP est decomposee en une serie d’impulsions de pitch 
sous-jacentes et en un signal non-voise simultane qui peut etre considere comme du bruit. 
Les impulsions de pitch sous-jacentes sont estimees a partir d’observations contaminees par 
du bruit, i.e. les impulsions de pitch extraites du residu LP. Comme les impulsions changent 
peu d’un instant a l’autre, nous appelons notre representation le modele de 1’evolution 
d’impulsions de pitch (PPE). Le modele PPE foumit un cadre pour analyser et controler 
de fagon efficace la periodicite de la parole voisee. 

Nous avons developpe un algorithme afin d’extraire les impulsions de pitch bruitees 
des residus LP base sur la minimisation de l’erreur par rapport a im ensemble d’impulsions 
modeles et nous avons examine plusieurs methodes pour calculer les impulsions sous-jacentes. 
Les formes des signaux d’impulsions de pitch qui evoluent, les positions des impulsions et 
le signal non-voise sont codes separement. Les positions et les formes des impulsions sous- 
jacentes ont seulement besoin d’etre codees de fagon sporadique et les caracteristiques des 
impulsions intermediaires sont obtenues par interpolation. 

Le programme informatique d’un codeur PPE a 4 kb/s est decrit. Les principales 
particularites du codeur PPE sont: une nouvelle approche de l’analyse de pitch; l’estimation 
des impulsions de pitch qui evoluent, ce qui permet de controler les caracteristiques de 
l’impulsion; et une methode de codage unique qui elimine les dilatations et contractions de 
temps des impulsions individuelles de pitch presents dans les autres codeurs d’interpolation 
de forme de signal. 



( 



111 


Acknowledgments 

First of all I want to express my gratitude to my supervisor Prof. Peter Kabal for his 
invaluable guidance throughout the course of this work. He fostered and helped to shape 
the ideas and concepts presented here. His assistance shows not only in the contents of this 
thesis but also in the style of this presentation. 

I would like to thank the Canadian Institute for Communication Research (CIRT) who 
financially supported this project. The research was conducted in the Telecommunications 
and Signal Processing (TSP) laboratory at McGill University and I would like to acknowl- 
edge the use of their very good facilities. 

I am very thankful to my friends and colleagues in the TSP laboratory. I am thinking of 
those who are here now as well as those who have already left. The friendly and supportive 
atmosphere that they provided was just as important as their technical help when I needed 
it. I am obliged to Hossein for proofreading parts of my thesis, to Khaled who contributed 
to that as well, to Florence and Marc who helped me with the French abstract. 

Some of my friends outside the university are particularly special to me. Glenn Seviss, 
being my longest roommate, proved to be also my best travel companion. Fran Yadao 
who has even read my thesis knowing little about engineering, and still less about speech 
compression, and despite being presently in Winnipeg. Rob Swick was usually there for me 
to talk about life in general (a great topic after a foil day in front of the screen) . I owe 
much to many other friends not mentioned here who have been part of my life in Montreal. 

My very special thanks go to Agnieszka Roginska who became particularly dear to me. 
She is responsible for some of the best times of my life, and I’m looking forward to more to 
come. 

Finally, I thank my parents for their great love, support and many words of encourage- 
ment. I do appreciate that they could write me more letters in. a month than I would In a 
year. They may not know it but they have a big part in this thesis being completed. 




Contents 


1 Introduction 1 

1.1 Motivation for Speech Coding 1 

1.2 The Basics of Speech Coding 2 

1.2.1 Speech Production and Perception 2 

1.2.2 Quantization 3 

1.2.3 Pulse Code Modulation 4 

1.2.4 Attributes of Speech Coders 5 

1.2.5 Evaluating the Performance 6 

1.3 State-of-the-Art Coders 6 

1.3.1 Linear Prediction Analysis-by-Synthesis Coders 7 

1.3.2 Frequency Domain Coders 8 

1.3.3 Waveform Interpolation Coders 9 

1.3.4 Other Coders 9 

1.4 Objectives and Scope of Our Research 9 

1.5 Organization of the Thesis 10 

2 Modelling the Excitation in Linear Predictive Coding 13 

2.1 Voiced and Unvoiced Speech 13 

2.2 Linear Prediction Analysis 17 

2.3 Modelling the LP Excitation with Fixed-Length Analysis 18 

2.3.1 Analysis-by-Synthesis 19 

2.3.2 Code-Excited Linear Prediction (CELP) 20 

2.3.3 Generalized Analysis-by-Synthesis 27 

2.4 Modelling the LP Excitation with Pitch-Synchronous Analysis .... 27 

2.4.1 Glottal Coding 28 


vi Contents 


2.4.2 Waveform Interpolation Coding 29 

2.4.3 The Pitch Pulse Evolution Model 29 

2.5 Summary 30 

3 The Pitch Pulse Evolution Model 33 

3.1 The PPE Concept 33 

3.2 Extraction of the Pitch Pulses 42 

3.3 Estimation of the Evolving Pitch Pulse . 44 

3.3.1 Linear Filtering 45 

3.3.2 Maximum Ratio Combining 46 

3.3.3 Noise Error Minimization 48 

3.3.4 Total Error Minimization 50 

3.4 Summary 60 

4 Interpolation of the Pitch Pulses 63 

4.1 Pitch Pulse-Length Interpolation 64 

4.1.1 Periodic and Quasi-Periodic Signals 64 

4.1.2 Pitch Interpolation in Existing Coders 65 

4.1.3 Is Time Warping Justified? 68 

4.1.4 Pitch Pulse-Length Interpolation in the PPE Model 69 

4.2 Pitch Pulse-Shape Interpolation 75 

4.2.1 Spectral Interpolation 75 

4.2.2 Spectral Interpolation in the PPE model 80 

4.3 Summary 81 

5 Implementation of the 4 kb/s PPE Coder 83 

5.1 The Coder Structure 84 

5.2 Linear Prediction Analysis and Coding 87 

5.3 Pitch Pulse Extraction 88 

5.3.1 Frame Classification 88 

5.3.2 Error Calculation . 91 

5.3.3 Segmentation of the LP Residual 92 

5.3.4 Computational Savings 100 

5.4 Coding the Pitch Pulse Positions 101 


Contents vii 


5.4.1 Choosing the Pitch Pulse Position to Code 103 

5.4.2 Pitch Pulse Length Interpolation 105 

5.5 Coding the Gain 106 

5.6 Coding the Shape of the Pitch Pulses 107 

5.7 Coding the Noise Component 110 

5.8 Testing and Remarks 114 

6 Final Remarks, Contributions and Future Work 117 

6.1 Summary of Our Work 117 

6.2 PPE Coding Versus WI Coding 119 

6.3 Our Contributions 122 

6.4 Claims of Originality 123 

6.5 Future Work 123 

A The Pitch Pulse Length Interpolation Algorithm 127 

B Weighted Minimum Square Linear Fit 129 

131 


Bibliography 



List of Figures 

2.1 General speech production model 14 

2.2 Voiced and unvoiced speech and the corresponding power spectra . . 15 

2.3 Voiced and unvoiced speech and the LP residual 16 

2.4 Linear prediction analysis-by-synthesis (LPAS) coding 20 

2.5 Code-exited linear prediction (CELP) coder 23 

2.6 Stages of the CELP analysis and synthesis 25 

2.7 The LP residual analysis 28 

3.1 Vector representation of pitch pulses for voiced LP residual 35 

3.2 Vector representation of unvoiced LP residual 35 

3.3 The error between pitch pulses 36 

3.4 The underlying, evolving pitch pulse 36 

3.5 The underlying pitch pulse and the noisy pulses 37 

3.6 Summary of the notation used in the PPE model 39 

3.7 Voiced/unvoiced decomposition of speech 40 

3.8 The LP residual with identified pitch pulses 53 

3.9 Estimation of the underlying pitch pulses (1) 54 

3.10 Estimation of the underlying pitch pulses (2) 55 

3.11 Estimation of the underlying pitch pulses (3) 56 

3.12 Comparison between the SVD and the weighted average estimation . 59 

4.1 Time warping versus time shifting 70 

5.1 Block diagram of the PPE encoder 86 

5.2 Block diagram of the PPE decoder 86 


X 



List of Tables 


3.1 Comparison between the underlying pitch pulse estimation using the SVD 
and the weighted average for different values of the error weight u>. . . 

5.1 Bit allocation in the 4 kb/s PPE coder 

5.2 The constants used in the pitch extraction algorithm. The values 

marked with an asterisk are subject to up-sampling rate F up , . which 
in the described coder is equal to eight 

5.3 Pitch quantizing table used in the start frame 

5.4 Pitch quantizing table used in the continue/end frame 



List of Acronyms 


First appears on page 

ACR Absolute Category Rating 6 

ADPCM Adaptive Differential Pulse Code Modulation 5 

CELP Code-Excited Linear Prediction 7 

DPCM Differential Pulse Code Modulation 5 

GSM Global System for Mobile Telecommunications . . 7 

IMBE Improved Multi-Band Excitation 8 

ITU International Telecommunication Union 8 

LP Linear Prediction 7 

LPAS Linear Prediction Analysis-by-Synthesis 7 

MBE Multi-Band Excitation 8 

MELP Mixed Excitation Linear Prediction 9 

MIPS Million instructions per second 10 

MOS Mean Opinion Score 6 

PCM Pulse Code Modulation 4 

PCS Personal Communication Systems 1 

PPE Pitch Pulse Evolution 10 

PSELP Pitch Synchronous Excited Linear Prediction 9 

PWI Prototype Waveform Interpolation 9 

RPE Regular-Pulse Excitation 7 

RPE-LTP Regular-Pulse Excitation with Long-Term Prediction 7 

SVD Singular Value Decomposition . 50 

QCELP Qualcomm CELP . . 7 

STC Sinusoidal Transform Coding 8 

TFI Time-Frequency Interpolation 9 

VSELP Vector Sum Excited Linear Prediction 7 

VQ Vector Quantization 4 

WI Waveform Interpolation 9 



Chapter 1 
Introduction 


1.1 Motivation for Speech Coding 

Speech communication is arguably the single most important interface between hu- 
mans, and it is now becoming an increasingly important interface between human and 
machine. As such, speech represents a central component of digital communication 
and constitutes a major driver of telecommunications technology. 

With the increasing demand for telecommunication services (e.g., long distance, 
digital cellular, mobile satellite, aeronautical services), speech coding has become a 
fundamental element of digital communication. Emerging applications in rapidly de- 
veloping digital telecommunication networks require low bit, reliable, high quality 
speech coders. The need to save bandwidth in both wireless and wireline networks, 
and the need to conserve memory in voice storage systems are two of the many reasons 
for the very high activity in speech coding research and development. New commer- 
cial applications of low-rate speech coders include wireless personal communication 
systems (PCS) and voice-related computer applications (e.g., message storage, speech 
and audio over internet, interactive multimedia terminals). 

In recent years, speech coding has been facilitated by rapid advancement in digital 
signal processing and in the capabilities of digital signal processors. A strong incentive 
for research in speech coding is provided by a shift of the relative costs involved 
in handling voice communication in telecommunication systems. On the one hand, 
there is an increased demand for larger capacity of the telecommunication networks. 
On the other, the rapid advancement in the efficiency of digital signal processors 


2 


Introduction 


and digital signal processing techniques have stimulated the development of speech 
coding algorithms. These trends are likely to continue, and speech compression most 
certainly will remain an area of central importance as a key element in reducing the 
cost of operation of voice communication systems. 

1.2 The Basics of Speech Coding 

1.2.1 Speech Production and Perception 

In speech coding, the bit-rate reduction is achieved by removing the inherent infor- 
mation redundancies present in the speech waveform. The understanding of the basic 
properties of the speech signal and its perception is crucial to the design of a speech 
coder which would, ideally, parameterize only perceptually relevant information and 
thus compactly represent the signal. 

When speech is produced, an airflow forced from the lungs passes through the 
larynx into the vocal tract. In the larynx, the elastic vocal folds can partially or 
completely obstruct the airflow creating a vocal tract excitation of turbulent noise 
or puffs of air. The opening between the vocal folds is called the glottis, and the air 
emanating from the vocal folds is often called the glottal excitation. 

The speech signal can be roughly divided into voiced and unvoiced segments. Dur- 
ing voiced speech the glottis periodically opens and closes and the glottal excitation 
has a periodic character. The excitation waveform corresponding to one cycle of glot- 
tal opening and closure is referred to as a glottal pulse, or pitch pulse. Consecutive 
pitch pulses may vary in their lengths and waveform shapes and the resulting glottal 
excitation is quasi-periodic. 

For unvoiced speech, the glottal excitation is formed as the air forced through the 
constriction of the glottis creates a turbulence. The glottis does not open and close 
periodically but only contracts causing perturbations in the airflow. The unvoiced 
excitation does not display any apparent periodicity and has a noisy character. 

The time properties of the speech production are reflected in the spectral features 
of the speech signal. The spectrum of the voiced excitation has a harmonic structure 
(i.e., sharp amplitude peaks at regular frequency intervals) with the fundamental 
frequency corresponding to the rate of the glottis closures. The spectrum of the 
unvoiced excitation has no prominent harmonics and it resembles the spectrum of a 
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white noise signal. The glottal excitation has no distinctive spectral envelope except 
for a spectral tilt during voiced speech. The spectral envelope, the broad peaks and 
valleys of the spectrum, is imposed on the glottal excitation by the vocal tract. For 
both, voiced and unvoiced excitation, the vocal tract acts as a filter shaping the 
frequency response of the speech signal. 

The non-flat frequency response of the vocal tract introduces correlation between 
adjacent samples of the speech signal (short-term correlations). During voiced speech 
the periodic character of the excitation results in the correlation between the cor- 
responding samples of adjacent pitch pulses (long-term correlation). In the spectral 
domain, the short-term correlation corresponds to the spectral envelope and the long- 
term correlation is reflected in the spectral fine structure. Both correlations introduce 
information redundancies in the speech signal and can be exploited in speech coding. 

It is not known exactly what analysis is performed by the human hearing system. 
One of the often used properties of the auditory system is the spectral masking phe- 
nomenon. The spectral masking makes the inaccuracy of the signal representation 
which occurs in and near high-energy frequency bands less audible than the inaccu- 
racy which occurs in other frequency regions. In the time domain, the human ear 
has a larger tolerance to the errors resulting from an inaccurate representation of 
high-energy samples than to the representation errors which coincide with low-energy 
samples. It is clear that both temporal and spectral characteristics of the speech sig- 
nal are important and this is increasingly reflected in modem coders. In fact, coders 
which combine time domain and frequency domain analysis are strong contenders in 
the area of very low-rate speech coding. 

1.2.2 Quantization 

Quantization is an integral part of every speech coder. Most parameters and ev- 
ery waveform used to represent the speech signal must be quantized before they are 
encoded. A quantized value and the corresponding coded quantity may be equiva- 
lent. More often however, the coded value is an index to a quantized parameter or 
waveform selected from a set of permissible quantization outcomes. In the process 
of quantization a numerical value, or a vector of values, is represented with reduced 
precision. The difference between the original value (vector) and its quantized version 
is the quantization noise. 
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If a single value is quantized at a time we deal with scalar quantization. The 
value is represented by one of several fixed discrete values called quantization levels. In 
uniform scalar quantization the quantization levels are equally spaced. In logarithmic 
quantization the spacing is uniform on a logarithmic scale. 

If a vector is represented with a fixed number of possible outcomes we perform 
vector quantization (VQ). The collection of the possible representations of a vector 
is referred to as a codebook. More than one codebook to represent a vector are 
often used. A large number of procedures have been proposed to create, organize, 
and search the codebooks. Such methods include tree-structured VQ, transform VQ, 
product code VQ, split VQ, gain-shape VQ, multistage VQ, hierarchical VQ (Gersho 
and Gray 1992). The method employed depends on the properties of the vector to 
be quantized and the desired criteria the quantized representation should satisfy. For 
example, in quantizing the linear prediction parameters used to represent the vocal 
tract characteristics in linear prediction coding, split VQ is often used. In modelling 
the linear prediction filter excitation, gain-shape VQ and multistage VQ are employed. 

Although scalar quantization is still used for quantizing some of the parameters, 
it is the application of vector quantization which enables significant reduction of the 
number of bits required to efficiently represent the speech signal. 

1.2.3 Pulse Code Modulation 

In pulse code modulation (PCM) coding the speech signal is represented as a series 
of quantized values which correspond to the amplitudes of the speech samples. In 
uniform 128 kb/s PCM, for example, narrow-band speech (200-3400 Hz) is sampled 
at 8 kHz and represented with 16 bits per sample by the means of uniform scalar 
quantization. In /i-law and A-law log-PCM, the samples are logarithmically quantized 
with 8 bits which results in the bit rate of 64 kb/s. Eight-bit log-PCM coders are 
widely used in network telephony. 

Uniform PCM does not exploit any specific properties of speech and is valid for 
any band-limited signal. Log-PCM takes advantage of the nonuniform distribution of 
speech amplitudes and the fact that the louder the sound the less sensitive the human 
ear becomes to small changes in the intensity of the sound. The latter property allows 
the increase of the quantization noise in the regions of high energy without significant 
loss of the reconstructed speech quality. PCM coding does not take advantage of the 
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existing correlations between speech samples. 

The correlation between adjacent samples is exploited in differential PCM (DPCM). 
In DPCM the difference between the current sample and its predicted value is quan- 
tized and transmitted. In a simple linear prediction the current sample is predicted 
from a number of past, reconstructed samples. In adaptive DPCM (ADPCM), the 
linear predictor or/and the quantization levels are varied based on the characteristics 
of the past reconstructed speech signal. The same predictor/quantizer modifications 
are performed by the encoder and the decoder. If the modifications are based on the 
reconstructed speech samples, the information about the modifications need not be 
encoded. 

The speech quality achievable with 64 kb/s log-PCM coding and 32 kb/s ADPCM 
is referred to as “toll” quality. The toll quality rating constitutes the reference point 
with respect to which the performance of lower bit rate coders is often compared. 

1.2.4 Attributes of Speech Coders 

The main attributes of a speech coder include: (i) the bandwidth of the speech signal 
for which the coder is intended, (ii) bit rate of the compressed signal, (iii) recon- 
structed speech quality, (iv) complexity and delay of the coder, (v) sensitivity of the 
coder to background acoustical noise, (vi) sensitivity of the encoded bits to trans- 
mission channel errors. Different applications require coders optimized for different 
features. In message transmission systems, for example, low-delay of the coder may 
not be an issue, and central storage systems may not require a low-complexity imple- 
mentation of the coder. While in a large number of applications the primary goal is 
to ensure the perceived similarity between the original and the reconstructed signal, 
in some cases (i.e., in the systems in which security is the main concern) it is sufficient 
that the reconstructed speech sounds intelligible and natural. In general, the central 
trade-off in speech coding Is between the bit rate of the compressed signal and the 
perceptual quality of the reconstructed speech. In most commercial applications real- 
time implementation of the coder is required. A real-time implementation imposes 
constraints on both the complexity and the delay of the coder. 
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1.2.5 Evaluating the Performance 

One of the major difficulties in designing and testing various speech coders is the 
lack of an objective quality measure to represent the perception-based goals in the 
form of an error function between the original and the reconstructed signal. The 
most commonly used objective criteria (signal-to-noise ratio, segmental signal-to- 
noise ratio, log spectral distance) are sensitive to gain variations and delays between 
the original and coded speech. They also usually do not fully account for perceptual 
properties of the hearing system. A number of objective methods based on human 
auditory perception models have been proposed (Schroeder et al. 1979, Wang et al. 
1992, Paillard et al. 1992, Jayant et al. 1993, De 1993), but none has yet eliminated 
the necessity of subjective testing. 

The most commonly performed subjective tests are absolute category rating (ACR) 
tests of which one example is the Mean Opinion Score (MOS) test (described for ex- 
ample by Kroon 1995). In the MOS test a number of listeners are asked to evaluate 
the quality of recorded speech according to a five-level scale. For narrow-band speech, 
a score of 4-4.5 implies toll quality and a score between 3.5 and 4 indicates commu- 
nications quality. Scores below 3.5 mean that the reconstructed speech is of poor 
quality; synthetic speech often scores in the range 2. 5-3. 5. The MOS scores can differ 
from one test to another significantly, often due to cultural and/or linguistic biases, 
and therefore are not an absolute comparison between coders. 

Subjective testing in general is time consuming and therefore expensive. Many 
proposed coders have not been subjected to rigid testing and the reported results are 
difficult to calibrate. 

1.3 State-of-the-Art Coders 

In the current state-of-the-art coders a noticeable coding noise appears at bit rates 
below 8 kb/s. The coded speech is natural, intelligible, the speaker is easily identified 
and his/her intonation is preserved, but the distortion is noticeable even though not 
annoying. This corresponds to MOS values above 3.5 and below 4.0. The naturalness 
is slightly lost at rates 2-4 kb/s. The coded speech has also increasing noisy “hoarse” 
quality with a varying degree of buzziness. Such coders usually obtain MOS values of 
3.0-3.5. At rates below 1 kb/s the speaker identity and naturalness are mostly lost. 
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Techniques that have been especially successful in achieving high quality speech 
at low bit rates include linear prediction (LP) coding and sinusoidal coding. Linear 
prediction coders operate mainly in the time domain while sinusoidal coders perform 
most of their analysis in the frequency domain. Some of the recent work (for example 
Waveform Interpolation) can be viewed as an attempt to combine time domain and 
frequency domain analysis. 

1.3.1 Linear Prediction Analysis-by-Synthesis Coders 

A particularly successful group of the LP coders is comprised of coders which use 
analysis-by-synthesis techniques (Kroon and Deprettere 1988, Kroon and Kleijn 1995, 
Cucchi et al. 1996). In linear prediction analysis-by-synthesis (LPAS) coding, the re- 
produced speech is synthesized by filtering an excitation signal with a time-varying 
linear filter. The coefficients of the synthesis filter are determined by linear predic- 
tion analysis of the speech signal. The excitation is determined by filtering excitation 
candidates with the synthesis filter and selecting the one which minimizes a percep- 
tually weighted distortion measure between the reconstructed and the original signal. 
LPAS coders include multi-pulse LP (introduced by Atal and Remde 1982), Regular- 
Pulse Excitation (RPE) (introduced by Kroon et al. 1986) and, most studied to date, 
Code-Excited Linear Prediction (CELP) (introduced by Atal and Schroeder 1984, 
Schroeder and Atal 1985). The initially large computational complexity of CELP 
was significantly reduced through the subsequent improvements (Davidson and Ger- 
sho 1986, Trancoso and Atal 1990, Kleijn and Krasinski 1990, Gerson and Jasiuk 
1991a, Elshafei- Ahmed and Al-Suwaiyel 1993, Moreau and Dymarski 1994), and over 
the years CELP became the most widely used speech coding technique. 

The success of the LPAS technique is reflected in the fact that many low-rate 
speech coding standards adopted in the last few years are LPAS coders. In Eu- 
rope the Regular-Pulse Excitation with Long-Term Prediction (RPE-LTP) coder at 
13 kb/s (MOS ~ 3.6) was chosen as a standard for the GSM (Global System for 
Mobile Telecommunications) digital cellular telephony. The Vector Sum Excited 
LP (VSELP) coder (Gerson and Jasiuk 1991b) operating at 5.6 kb/s (MOS ~ 3.5) 
was selected as the corresponding half-rate standard. In the North American digi- 
tal cellular telephony VSELP operating at 7.95 kb/s (MOS ~ 3.5) and Qualcomm 
CELP (QCELP) (DeJaco et al. 1993) at 8.5 kb/s (MOS ~ 3.4) were chosen as interim 
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standards. The U.S. government has adopted a 4.8 kb/s CELP coder (MOS ~ 3.2) as 
the secure voice communication standard (Campbell et al. 1989). The International 
Telecommunication Union (ITU) has very recently adopted a new standard for 8 kb/s 
toll quality coding — G.729; the standard is a CELP-based coder with MOS ~ 4.0 
(Salami et al. 1994, 1995). 

1.3.2 Frequency Domain Coders 

Although the LPAS-based coders produce very high quality speech in the range of 
4-16 kb/s, their performance degrades rapidly around 4 kb/s (Atal and Caspers 
1991, Tzeng 1991), at which point the performance of time domain waveform match- 
ing (even with a carefully chosen perceptually-weighting error criterion) deteriorates. 
A viable alternative to LPAS coders, particularly in the range 2-4 kb/s, is com- 
prised of coders which directly use frequency representations in their analysis. The 
most prominent frequency domain techniques for low-rate coding are: harmonic cod- 
ing (Almeida and Tribolet 1982, Marques et al. 1990), Sinusoidal Transform Cod- 
ing (STC) (McAulay and Quatieri 1986, McAulay et al. 1991, McAulay and Quatieri 
1995), and coding based on Multi-Band Excitation (MBE) (Hardwick and Lim 1988, 
1989, Brandstein et al. 1990). The three coding methods are sometimes grouped 
under a common name as sinusoidal coders. 

In the sinusoidal coding, the spectral peaks of a short time Fourier transform are 
identified and the speech signal is reconstructed by interpolation of the amplitudes, 
the phases and the frequencies of a set of sine waves. Although the amount of work 
on sinusoidal coders has been small compared to CELP, there are many indications 
that this is a promising approach for the future (Gersho 1994). For example, an 
Improved Multi-Band Excitation (IMBE) coder (Brandstein et al. 1990) operating 
at 4.15 kb/s (MOS ~ 3.3) was selected by Inmarsat as a standard for satellite voice 
communications. 

An insightful comparison between CELP and the sinusoidal coding is offered by 
Trancoso et al. (1990). The authors argue that the two techniques are complementary 
and might well be merged in future systems. 
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1.3.3 Waveform Interpolation Coders 

Waveform Interpolation (WI) is an attempt to combine aspects of the time domain 
and the frequency domain analysis. Prototype Waveform Interpolation (PWI) (Kleijn 

1991, Kleijn and Granzow 1991) and Time-Frequency Interpolation (TFI) (Shoham 

1992, 1993b) techniques are precursors to the recent WI (Kleijn and Haagen 1994b, 
1995b). A WI coder implemented at 2.4 kb/s demonstrated very high quality of 
synthesized speech with MOS ~ 3.5 (Kleijn and Haagen 1995a, Kleijn et al. 1996). 
Over the last couple of years many techniques have been suggested for use within 
the WT framework (Tanaka and Kimura 1994, Burnett and Bradley 1995, Jiang and 
Cuperman 1995, Festa and Sereno 1995, Tang and Cheetham 1995). Although inter- 
polation of the prototype waveforms is usually performed in the frequency domain, 
interpolation in the time domain has also been implemented with good results (Yang 
et al. 1995). 

Similarities and differences between WI and STC are examined in (Sen and Kleijn 
1995) and (Kleijn and Haagen 1995b). 

1.3.4 Other Coders 

A number of other coders have been implemented at bit rates below 4 kb/s with good 
results. The five coders evaluated in the second stage of the competition for the U.S. 
government standard for 2.4 kb/s secure voice communication were: an IMBE coder, 
a STC coder, a WI coder, a Pitch Synchronous Excited Linear Prediction (PSELP) 
coder (Fette et al. 1993), and a Mixed Excitation Linear Prediction (MELP) coder 
(McCree and Barnwell III 1993, 1995). The last two coders, the PSELP coder and 
the MELP coder, use linear prediction but they do not choose the LP excitation 
based on analysis-by-synthesis; both coders perform part of their analysis in the 
frequency domain. The MELP coder was selected as the winning candidate for the 
aforementioned U.S. federal standard (McCree et aL 1996). 

1.4 Objectives and Scope of Our Research 

In this thesis we are concerned with telephone quality speech band-limited from 200 Hz 
to 3.4 kHz. The analog signal is sampled at 8 kHz and represented with 16 bits uni- 
form PCM resulting in a digital signal with the bit rate of 128 kb/s. Our goal is to 
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represent this digital signal with a bit stream of about 4 kb/s with the reconstructed 
speech signal very close or equivalent to toll quality. 

We propose a new speech coding method based on our Pitch Pulse Evolution (PPE) 
model. We introduce and describe the PPE model in the context of existing coding 
systems and we present an implementation of a 4 kb/s PPE coder. 

Our coding system is constrained to have moderate complexity and algorithmic 
delay 1. A moderate complexity coder is a coder which is implementable on a single 
fixed-point 16-bit DSP chip which can perform about 40 million instructions per sec- 
ond (MIPS). Moderate algorithmic delay is understood to be about 50-60 ms, which 
includes the processed speech block and the look-ahead. Such moderate complexity 
and delay is a requirement for real-time operation in the context of applications used 
for conversational speech. 

In this work we have concentrated on achieving high quality reconstructed speech. 
We have not been directly concerned with the sensitivity of the coder to background 
acoustic noise or with the sensitivity of the encoded bits to transmission errors. How- 
ever, the final configuration has elements of similarity to existing coders and as such 
we do not expect the PPE coder to be unduly sensitive to these factors. 

1.5 Organization of the Thesis 

The organization of this thesis is as follows. In Chapter 2 we review the principles 
of linear prediction analysis-by-synthesis (LPAS) coding in more detail and discuss 
ways of representing the LP excitation. In LP coding poor representation of the 
excitation for voiced segments is to a large extent responsible for the degradation of 
speech quality with decreased bit rate. Various techniques for improving the quality 
of voiced speech are examined. 

In Chapter 3 a general formulation of the pitch pulse evolution PPE model is 
presented and demanding requirements are imposed on the pitch pulse extraction 
algorithm. The problem of estimating the evolving pitch pulses is discussed and 
several methods of the estimation are investigated. 

The focus of Chapter 4 is on pitch interpolation. We compare the pitch inter- 
polation used in Waveform Interpolation (WI), Sinusoidal Transform Coding (STC), 

t Algorithmic delay is the sum of (i) the length of currently processed block of speech, (ii) the 
length of the look-ahead which is needed to process the samples of the current block. 
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and Relaxed-CELP (RCELP). The pitch pulse length and the pitch pulse waveshape 
interpolations used in the PPE model are described. 

Chapter 5 presents the implementation of a PPE coder with emphasis on the 
components which are unique to our coder. Among others, a practical and robust 
method for extracting individual pitch pulses from the LP residual is developed and 
the strategy for encoding the pitch information is specified. We discuss the results 
of informal comparison tests of quality of the PPE coded speech with respect to the 
original signal and with respect to the speech coded with G.729. 

Our work is summarized and the future research directions are outlined in Chap- 
ter 6. This chapter also states the contributions of this thesis and lists the claims of 
originality in our work. 



Chapter 2 

Modelling the Excitation in Linear 
Predictive Coding 


In a general speech production model air flows from the lungs to the larynx where it 
is forced through a variable opening between vocal folds (vocal cords). The opening 
between the vocal folds is called the glottis and the airflow which emanates from the 
folds is often referred to as the glottal excitation. The excitation passes through the 
vocal tract which can be modelled as an acoustic tube. The speech signal is created 
as the air exits the vocal tract causing a waveform of air pressure variations. 

The characteristics of the vocal tract are determined by the shape of the passage 
though which the glottal excitation flows. The air-flow passage is shaped by: the oral 
and nasal cavities, the tongue, the teeth, the lips, and a number of other articulators 
(see for example O’Shaughnessy 1987). The shape of the air passage influences the 
transfer function of the vocal tract. The effects of the glottal excitation and the vocal 
tract are considered to be independent (Rabiner and Schafer 1978, O’Shaughnessy 
1987, Deller Jr. et al. 1993), which is justified by the fact that the interaction be- 
tween the vocal tract shape and the lung pressure is negligible with respect to other 
simplifications of the model. 

2.1 Voiced and Unvoiced Speech 

The speech signal can be roughly divided into voiced and unvoiced segments. For 
voiced speech the excitation is generated by a periodic opening and closing of the 
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glottis resulting in a series of similar pitch pulses. During unvoiced speech the glottal 
excitation is a flat power spectrum noise, which is often modelled by a random noise 
generator. A general, simplified speech production model is presented in Fig. 2.1. 
The relative ratio of the unvoiced and voiced components is controlled in the model 
by adjusting the corresponding gains. For a “purely voiced” signal the noise source 
gain is zero, and for a “purely unvoiced” signal the voice source gain is set to zero. 


Voice Source 
Gain 



Noise Source 
Gain 


Fig. 2.1 General speech production model. 

A short-time power spectrum of speech calculated with a smooth time window 
of 30 ms displays the basic characteristics of the speech signal (Fig. 2.2). We can 
identify the fine spectral structure due to the glottal excitation and the spectrum 
envelope imposed by the vocal tract. During voiced speech, one can observe in the 
fine structure regularly spaced harmonics which are the result of periodic oscillations 
of the vocal folds. During unvoiced speech, the fine structure does not display any 
apparent harmonic makeup. The unvoiced excitation is noise-like (with no periodicity 
evident). The broad peaks of the spectral envelope correspond to resonances of the 
acoustic tube of the vocal tract. The resonances are called formants and the vocal 
tract is said to impose a formant structure on the glottal excitation. 

The fine structure of the spectrum is related to long-term correlation of the sam- 
ples of the signal in the time domain. During voiced speech, the harmonic spectral 
structure implies a similarity of sequential cycles of the pitch period. For unvoiced 
speech, the long-term sample correlation is very small or nonexistent. The spectral 
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0 2 kHz 4 kHz 0 2 kHz 4 kHz 


Fig. 2.2 ( a) Unvoiced and voiced segments of speech signal, (b) Power 

spectra calculated in the unvoiced and the voiced regions respectively. 
The power spectra were calculated over segments 30 ms long smoothed 
with a Hamming window. 


envelope corresponds to short-term correlations between nearby samples. Both the 
long-term and the short-term correlations are important and they are exploited in 
speech coders. 

In linear predictive (LP) coding a linear filter of the form 


1 

A{z) 


1 

AT 

i=l 


( 2 . 1 ) 


models the short term correlation in the speech signal (spectral envelope) introduced 
by the vocal tract. The LP filter coefficients ai, a# are estimated and transformed 
into a set of parameters judged to have better coding properties and error robustness. 
The LP parameters are then coded and transmitted. 

The glottal excitation is modelled based on the LP residual and/or the error 
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between the original and the reconstructed speech. The LP residual is formed by 
filtering the speech signal with the time-varying LP analysis filter 

N 

A(z) = 1 — ^a.iZ ~ l . (2.2) 

i=l 

Voiced and unvoiced speech and the corresponding LP residual are presented 
in Fig. 2.3. The LP residual roughly corresponds to the glottal excitation which 
emanates from the larynx. One can observe randomness of the LP residual within 
the unvoiced region and well defined energy peaks within the voiced region. The 
peaks correspond to the pitch pulses present in the voiced excitation. 

[< unvoiced voiced =►) 


(a) 


<b) 


o 

Fig. 2.3 Unvoiced and voiced segments of (a) speech signal (b) LP resid- 
ual (scaled by a factor of 2). 

It has been shown (Kubin et al 1993) that the LP model is capable of producing 
very high quality speech for the unvoiced regions even if no bits are assigned to coding 
the excitation vector (the excitation is generated as a series of independent Gaussian 
random numbers) . This is possible because the noise-like excitation of the unvoiced 
speech contains little perceptually important information. The difficulty of coding 
voiced speech at low bit rates stems from the fact that the human ear is particularly 
sensitive to small changes in the speech periodicity. At low bit rates, the small changes 
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between consecutive pitch pulses are very hard to model with the small number of 
bits available per coding update. 

2.2 Linear Prediction Analysis 

The LP filter coefficients are determined from the speech signal using linear predic- 
tion techniques (Makhoul 1975, Markel and Gray 1976, Rabiner and Schafer 1978, 
Deller Jr. et al. 1993). The traditional auto-correlation and covariance methods of 
calculating the LP coefficients have new alternatives such as discrete all-pole mod- 
elling (El-Jaroudi and Makhoul 1991). Many improvements to the basic estimation 
methods of the LP coefficients are summarized by Paliwal and Kleijn (1995). 

The update rate for the LP coefficients is related to the characteristics of the 
vocal tract. Most of the time, the shape of the vocal tract changes relatively slowly 
in time. The vocal tract articulators move usually less than 1 cm at a time at speeds 
up to 30 cm/s O’Shaughnessy (1987). This translates into a change period of about 
30 ms although during slow speech the shape of the vocal tract may not change for 
up to 200 ms. The vocal tract characteristics can also change rapidly, e.g., when the 
air-flow passage of the vocal tract closes or opens at the lips. The LP coefficients are 
calculated with update rates varying from 30 to 100 times per second (every 30 to 
10 ms). 

The number of calculated LP coefficients is related to the number of formants 
present in the spectrum of the speech signal. The vocal tract imposes formant struc- 
ture on the glottal excitation with an average of one formant per 1 kHz. A few 
coefficients are used to better approximate the spectral valleys and general shape 
of the spectrum. The number of calculated coefficients is often equal to 10-12 per 
update. 

The LP coefficients are usually not coded directly but first transformed into a 
set of parameters which has desirable coding properties. Various representations of 
the LP coefficients have been proposed. Currently the most popular are Line Spec- 
tral Frequencies (LSF) also known as Line Spectral Pairs (LSP) (Soong and Juang 
1984). Other representations include reflection coefficients, log-area ratio, cepstral 
coefficients, and the LP filter impulse response (Rabiner and Schafer 1978) . 

Considerable work has been done in developing efficient quantization methods for 
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the LP parameters. Scalar quantizers achieve transparent coding! at rates of 32 bits 
per update with 50 updates per second, which results in coding rate of 1.6 kb/s (Soong 
and Juang 1993). Vector quantization techniques (Gray 1984, Makhoul et al. 1985, 
Gersho and Gray 1992) have provided means to transparent coding at rates as low 
as 24 bits per update which, with 50 updates per second, results in a rate of 1.2 kb/s 
(Paliwal and Atal 1993). Developments in LP analysis and coding are reviewed for 
example by Paliwal and Kleijn (1995). 

The LP model assumes independence between the excitation and the parameters 
of the linear filter so that separate analysis and interpolation of the LP parameters 
and the LP excitation can be performed. The LP parameters are often up-sampled 
(interpolated) to the rate of 200-400 parameter sets per second. The interpolation of 
the LP parameters in different domains has been studied recently by Paliwal (1995) 
with indication that the interpolation in the LSF domain has desirable properties. 

Although major progress has been made in reducing the bit rate for encoding the 
LP parameters, the bit rate for the transparent encoding of the LP excitation still 
remains very high. A multitude of methods for representing the excitation signal 
have been proposed but the lack of an efficient representation of the excitation still 
remains a major obstacle in synthesizing high quality speech at low bit rates (Atal 
and Caspers 1991). We review a number of techniques which aim at improved coding 
of the LP excitation. 

2.3 Modelling the LP Excitation with Fixed-Length Analysis 

In this section we describe linear prediction analysis-by-synthesis (LPAS) coding. 
Although analysis-by-synthesis does not have to be performed with fixed analysis 
block lengths, it was first developed in such a context (Kroon and Deprettere 1988). 
Some coders which perform their analysis with pitch-synchronous block lengths, for 
example the WI coder discussed later, also employ elements of analysis-by-synthesis 
coding, i.e., error measurement with respect to the perceptually weighted speech. 
First, the LPAS coding is presented. Then, some of the proposed improvements 
in the representation of the LP excitation signal for Code-Excited Linear Prediction 
(CELP) coders are examined. Finally, the generalized analysis-by-synthesis paradigm 

t Transparent coding of a parameter generally means that the coding of the parameter does not 
introduce any perceptual distortion in the reconstructed signal. 
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is introduced. 


2.3.1 Analysis-by-Synthesis 

In the analysis-by-synthesis procedure, the paxameters describing the LP excitation 
are determined by minimizing the perceptually weighted mean square error between 
the original and the reconstructed speech (Kroon and Deprettere 1988) (Fig. 2.4). 

The perceptual weighting exploits the masking properties of the human hearing 
system. The masking makes the noise in and near frequency bands of high energy less 
audible than the noise at the frequencies corresponding to the energy valleys. The 
perceptual weighting filter emphasizes the error in the spectral valleys of the input 
speech and deemphasizes the error in the regions of spectral peaks. As the effect, the 
quantization noise in the valleys is reduced and the noise near the peaks is increased. 
This increased noise on the spectral peaks is masked by the human auditory system. 
The perceptual weighting is often specified as a filter 


W{z) = 


i) 

A{zfr 2 ) 5 


0 < 71 < 72 < 1 


(2.3) 


where A(z) is the LP analysis filter as given in (2.2). The values 7 X are 72 are fixed 
or adaptive. 

The analysis-by-synthesis approach is often call “closed loop” analysis as opposed 
to “open loop” analysis in which the parameters are determined without reconstruc- 
tion of the speech signal. The “closed loop” analysis is usually computationally more 
expensive than the “open loop” approach. In practice those two are often combined. 
“Open loop” analysis provides a set of initial candidates for parameter representation 
and the “closed loop” analysis serves as a final criterion for selecting the best set 
of parameters. Many techniques for reducing the computational complexity of the 
LPAS coders have been reviewed by Kroon and Deprettere (1988), Gersho (1994), 
Kroon and Kleijn (1995). 

In the improvements of the coded speech quality, the emphasis is on perceptually 
accurate representation of the periodicity of the coded LP excitation. Poor represen- 
tation of speech periodicity in the voiced regions is the main shortfall of LPAS coders 
operating at low bit rates. 
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Fig. 2.4 Linear prediction analysis-by-synthesis coding. 

2.3.2 Code-Excited Linear Prediction (CELP) 

In a CELP coder the LP excitation is modelled using vector quantization (VQ) . The 
vectors which represent the LP excitation are selected with the analysis-by-synthesis 
procedure. 

Stochastic Codebook 

The unvoiced part of the excitation is modelled in CELP by the so-called stochastic 
codebook. The stochastic codebook is also used to model the start and changes of 
the voiced excitation. The same fixed codebook is used at the transmitter and the 
receiver. The index to the selected codebook entry is transmitted. 

In early CELP the entries of the stochastic codebook were populated with Gaus- 
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sian independent random numbers. The search of such an unstructured codebook 
necessitates a very high computational complexity. To reduce this complexity and to 
reduce the required storage space, a variety of structural constraints have been im- 
posed on the codebook. The proposed structures of the stochastic codebook include 
overlapped codebooks, sparse codebooks and algebraic codebooks (see for example 
Kleijn and Krasinski 1990, Gersho 1994). 

More than one codebook may be used to represent the unvoiced contribution; 
the configuration of multiple codebooks is often called multistage VQ (Gersho and 
Gray 1992). In multistage VQ, the excitation vector is generated as a sum of scaled 
entries from several codebooks which are sequentially searched. The sequential search 
of multiple codebooks is suboptimal and a joint search usually introduces excessive 
complexity. To approach the optimal selection, the orthogonalization of the multiple 
codebooks is used (Gerson and Jasiuk 1991b, Moreau and Dymarski 1994). 


Pitch Filter 


A simple model of the periodicity present in the LP residual can incorporate a pitch 
filter. The pitch synthesis filter is specified as 


1 1 
P{Z) ~ 1-I3Z-M ’ 


(2.4) 


where (5 and M are respectively the gain coefficient and the pitch lag. The lag M 
approximates the periodicity (or the pitch period) of the signal. The gain /? can 
be interpreted as an indicator of the “leveL of periodicity” with /3 approaching the 
value of 1 for “very periodic” signals. Although the parameters of the pitch filter 
are determined via analysis-by-synthesis (“closed-loop”), the initial estimate of the 
parameters is often performed with “open-loop” methods. The properties of the pitch 
filters have been studied for example by Ramachandran and Kabal (1987, 1989). 

For voiced speech, the pitch period varies typically from 2 to 20 ms. For the 8 kHz 
sampling rate, to facilitate 7-bit encoding, the range of the delays is often limited 
from 20 to 147 samples (128 possible delays). The update rate of the pitch predictor 
parameters is higher than that of the LP parameters, typically 200 times per second 
(every 5 ms). The gain coefficient can be encoded with 3 to 4 bits per update, which 
means that the bit rate for coding only the pitch information is from 2 to 2.2 kb/s 
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(higher than the coding rate of the LP parameters) . 

Multi-tap pitch filters have been also suggested with reported better performance 
than the single-tap filter. Their disadvantage, however, is an even higher bit rate 
needed to encode the filter coefficients. Using vector quantization with 5-7 bits to 
code the coefficients of a three-tap filter results in the pitch-information coding rate 
of 2.4-2. 8 kb/s. 

Fractional Pitch 

In an important contribution to modelling the LP residual for voiced speech, the 
search of the pitch period is refined to a fraction of a sample (the analysis is per- 
formed with sub-sample resolution) (Kroon and Atal 1990, Marques et al. 1990). The 
fractional pitch is used as an alternative to multi-tap pitch filters. Although the frac- 
tional pitch increases the bit rate of the coded pitch information, the technique is 
now used in many CELP implementations. Most coders use a nonuniform spacing 
with higher resolution for shorter delays. No more than one or two additional bits 
per update are often used and the total bit rate is increased only by 200-400 b/s. 

Adaptive Codebook 

In most of the modern implementations of the CELP coder, the pitch filter is repre- 
sented as a codebook called the adaptive codebook (Kleijn et al. 1988b, a). In contrast 
with the adaptive codebook, the stochastic codebook is often called the fixed code- 
book. A CELP coder with two codebooks, the adaptive and the fixed codebook, is 
shown in Fig. 2.5. 

In a sequential search, first every entry of the adaptive codebook is tried and the 
vector which minimiz es the perceptually weighted error is selected. The optimal gain 
for the selected vector is calculated. The adaptive codebook contribution, multiplied 
by the optimal gain, is used during the search for the fixed codebook entry. Again, the 
vector which minimizes the perceptually weighted error is selected and the optimal 
gain for the fixed codebook contribution is calculated. 

The adaptive codebook can be interpreted as a generalization of a pitch filter. As 
a special case, the co debook entries can be formed from the output of a pitch filter 
applied to the past LP excitation. In a case of a single-tap pitch filter, the entries of 
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Fig. 2.5 CELP coder implementation using an adaptive and a fixed 
codebook. 


the adaptive codebook would be formed from the samples of the past excitation and 
the gain of the codebook would be the coefficient of the pitch filter. 

In a two-codebook structure* CELP of Fig. 2.5, controlling the periodicity of 
the LP excitation can be achieved by (i) modifying the error weighting and thus 
influencing the selection of the adaptive and fixed codebook entries, (ii) controlling 
the relative gain of the two codebooks when forming the LP excitation, (iii) a specific 
way of forming the entries of the adaptive codebook and/or the fixed codebook. 

To better describe the various techniques used to improve the representation of 
the LP excitation, the operations performed in a CELP coder axe classified into three 
stages (Fig. 2.6): 

t As mentioned earlier, several codebooks are sometimes used to represent the stochastic contri- 
bution in which case the fixed codebook is implemented as more than one codebook. 






24 


Modelling the Excitation in Linear Predictive Coding 


(i) The entries of the codebooks are selected and gains of the codebooks are calcu- 
lated. 

(ii) The selected entries are combined to form the LP excitation and the coded 
speech is synthesized. The codebook gains can be used as calculated in stage (i) 
or they can be updated. 

(iii) The adaptive codebook is updated. 

The operations of stages (i) and (iii) are performed at the encoder. The task of the 
encoder is to code the parameters which are to be used in stage (ii). The decoder 
performs the operations of stages (ii) and (iii). 

Stage (ii) includes post-filtering for which parameters are not directly coded. The 
post-filter parameters are determined from the LP filter parameters, the gains of 
the codebooks, and the index of the adaptive codebook. Post-filtering is included 
in the same functional block as the LP synthesis filter because a pitch post-filter 
often precedes the LP synthesis filter and a formant post-filter usually follows the 
LP synthesis filter. We do not consider post-filtering as a method which attempts to 
refine the representation of the LP excitation. Post-filtering in many cases improves 
the quality of the reconstructed speech but it does not contribute to the modelling of 
the LP excitation. 

We now describe a number of methods which, by modifying the operations in 
stages (i), (ii) and (iii), aim to improve the modelling of the LP excitation. 

Harmonic Noise Weighting 

In harmonic noise weighting (Gerson and Jasiuk 1991a), a multi-tap pitch filter is used 
to further weight the perceptually weighted error. By changing the error weighting, 
the selection of the codebook entries is influenced. The calculated gains are also, 
in general, different. Harmonic noise weighting deemphasizes the error at pitch har- 
monics and the fixed codebook contribution is steered to better match the signal 
spectrum between the harmonics. The method takes advantage of the property of 
auditory masking which suggests that the noise between the harmonics is more au- 
dible than the noise on the harmonics. The gains calculated in stage (i) are used in 
stages (ii) and (iii). The method modifies stage (i) only. 
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(«) 


(ii) 


(iii) 



Fig. 2.6 The three stages of coding the LP excitation in a CELP coder: 
(i) selection of the codebook entries and calculation of the gains, (ii) syn- 
thesis of the coded signal, (iii) update of the adaptive codebook. 
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Constrained Excitation 

In the constrained excitation technique (Shoham 1991), the fixed codebook contri- 
bution gain in stages (ii) and (iii) is different from the gain calculated in stage (i). 
The gain of the fixed codebook contribution used in (ii) and (iii) is reduced if the 
contribution of the adaptive codebook is determined to be large. Decreasing the gain 
of the fixed codebook component enhances the periodicity of the synthesized signal. 
Also, the adaptive codebook is updated with a signal which contains a smaller noise 
component. The method modifies only the fixed codebook gain in stages (ii) and (iii). 

Pitch Synchronous Innovation 

In the pitch synchronous innovation technique (Mano et al. 1995), the fixed codebook 
contribution is made more periodic based on the estimated pitch period. If the pitch 
period is shorter than the vectors in the fixed codebook, the entries of the codebook 
are modified. The new entries are formed by repeating pitch-period-sized blocks which 
are part of the old entries. The method increases the periodicity of the synthesized 
speech for pitch values shorter than the length of the fixed codebook vectors. The 
same fixed codebook change is applied in all three stages. The gains of the codebook 
contributions in stages (ii) and (iii) are not changed - they are as determined in 
stage (i). 

Comb Filtering 

In comb filtering (Wang and Gersho 1990), an extra filter is inserted after the sum- 
mation of the adaptive and the fixed codebook contributions. The extra filter is used 
in all three stages. The filter proposed has the form 

= < 2 - 5 > 

with rj = 0.2, 7 = 0.6, and A = 0.001-Fo where Fq is the estimated fundamental 
frequency and p is the determined pitch period. The filter is designed to suppresses 
the noise between pitch harmonics. As in the harmonic noise weighting and the pitch 
synchronous innovation, the gains of the codebook contributions used in stages (ii) 
and (iii) are as calculated in stage (i). 
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Pitch Sharpening 

In pitch sharpening (Taniguchi et al. 1991), the update procedure of the adaptive 
codebook is modified. The suggested modifications include reducing the fixed code- 
book gain and center-clipping the adaptive codebook vectors. The control over the 
signal periodicity is exercised only via changing the update of the adaptive codebook. 
The signal fed back to the adaptive codebook is different from the LP excitation used 
for synthesizing the reconstructed speech. The method modifies stage (iii) only. 

2.3.3 Generalized Analysis-by-Synthesis 

The generalized analysis- by-synthesis coding leads to reduction of the number of bits 
required for encoding the pitch information (Kleijn et al. 1992, 1994). In the gener- 
alized LPAS the original speech signal is time-scale modified to facilitate infrequent 
pitch updates. The modifications should be such that no perceptual distortion is 
introduced. The update rate of the pitch information is typically reduced from 200 to 
50 times per second and the intermediate pitch values are obtained by interpolation. 
The amount of bits needed for coding the pitch is cut by the factor of four. A number 
of Relaxed-CELP (RCELP) coders have been implemented based on the generalized 
LPAS paradigm (Kleijn et al. 1993, 1994, Nahumi and Kleijn 1995). 

2.4 Modelling the LP Excitation with Pitch-Synchronous 
Analysis 

Traditionally (e.g., in CELP) the LP excitation analysis has been carried out on 
a fixed-rate, fixed-analysis-block-length basis. In this approach the analysis-block 
boundaries are asynchronously imposed on the signal and some of the important 
features (pitch pulses) could be split into two separate blocks. 

In the case of speech produced by an idealized model, the voiced excitation is 
formed by a series of pitch (glottal) pulses which can be seen as separate entities. 
The analysis in this case should be pitch synchronous preferably in both the rate and 
the analysis block length (Fig. 2.7). One of the advantages of the pitch-synchronous 
analysis is that the periodicity of the signal can be controlled by moving individual 
pitch pulses. 
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Fig. 2.7 Different types of the LP residual analysis. The numbers indi- 
cate the length of analysis blocks, (a) Fixed-rate, fixed-block-length anal- 
ysis. (b) Fixed-rate, pitch-synchronous block-length analysis, (c) Pitch- 
synchronous-rate and pitch-synchronous block-length analysis. 


2.4.1 Glottal Coding 

In a group of LP based coders called the glottal coders the pitch (glottal) pulses 
are modelled directly by a fixed number of parameters (Hedelin 1986, Fujisaki and 
Ljungqvist 1986, Krishnamurthy 1992, Childers and Hu 1994). The pulses are iden- 
tified and the parameters are estimated in an “open loop” fashion from the identified 
pulses. The coding is based on individual pitch pulses and the analysis-by-synthesis 
procedure is not used. 

The glottal coders require a reliable detection of the boundaries of the glottal 
pulses. A number of algorithms estimating the instance of the glottal closure have 
been proposed (Cheng and O’Shaughnessy 1989, Ma et aL 1994, Smits and Yegna- 
narayana 1995) but they are often not very reliable, particularly for noisy speech. 

The glottal coders, in general, lack an adequate mechanism to represent the pitch 
pulse parameters in a way which maintains the perceptually “correct” periodicity 
of the coded speech. The noise-like component of the LP residual, for example, is 
modelled only for the unvoiced regions. 
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2.4.2 Waveform Interpolation Coding 

Waveform Interpolation (WI) coding also models individual pitch pulse waveforms. 
The analysis is pitch synchronous in block-length and the analysis-rate is fixed (see 
Fig. 2.7b). In the originally proposed Prototype Waveform Interpolation PWI (Kleijn 
1991, 1993), waveforms of relatively distant pitch pulses are extracted and intermedi- 
ate pulses are interpolated from these prototypes. This approach did not fully utilize 
the actual intermediate pulses to identify appropriate prototypes. Special measures 
were taken to control the periodicity level of the coded speech and approximate time 
synchrony between the original and the reconstructed signal was maintained. In the 
Time-Frequency Interpolation (TFI) method (Shoham 1993b,a) the analysis is per- 
formed more often to improve tracking of the inter-pitch variations. The higher-rate 
analysis results in a better overall quality of the reconstructed speech. In the pro- 
posed improvements to the original PWI technique the prototype waveforms, called 
in WI the characteristic waveforms, are also extracted with a higher rate. They are 
additionally filtered to separate their periodic and noise components called the slowly 
evolving waveform (SEW) and the rapidly evolving waveform (REW) . The SEW and 
the REW are then coded separately (Kleijn and Haagen 1994b, 1995b). The pitch 
interpolation employed in WI coding is such that the time synchrony between the 
original and the reconstructed speech is not maintained. We write more about pitch 
interpolation and time synchrony in Chapter 4. 

The WI coder is sometimes classified as LPAS (Gersho 1994). In fact the weighted 
error measurement used in coding the parameters of the WI model is usually per- 
formed with respect to the original speech signal. But the LPAS coders are waveform 
coders in the sense that with decreasing quantization error the reconstructed signal 
converges to the original. In a WI coder it is not in general true and hence WI is 
often put into the class of parametric coders (for which the reconstructed signal does 
not converge to the original even with decreasing quantization error) . 

2.4.3 The Pitch Pulse Evolution Model 

Observing successive pitch waveshapes one can see an evolution though the waveforms 
are often obscured by noise components that tend to be different for every pitch pulse. 
We have developed a Pitch Pulse Evolution (PPE) model (Stachurski and Kabal 1994) 
to efficiently track the changes of the pitch waveforms. In the model, a canonical 
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waveshape based on a number of noisy pitch pulses is identified, and the canonical 
waveform is called the underlying pitch pulse. The observed pulses are coded with 
respect to the estimated underlying pulse. 

The model consists of two parts. The voiced LP excitation is composed of a series 
of pitch pulses. The pitch pulse waveshapes evolve slowly and they may overlap if 
the lengths of the pulses are small enough. Superimposed on the pitch waveform is 
an unpredictable component — the unvoiced, noise-like part of the signal. We do not 
presuppose any particular shape for the pitch pulses, only that the waveshapes of the 
pitch pulses have some form of continuity from one instance to another. The periodic- 
ity of the reconstructed waveform is controlled by (i) adjusting the level of similarity 
between the consecutive pitch pulses, (ii) changing the amount of the superimposed 
noise, (iii) placing pitch pulses at encoded (and calculated) positions. 

The pitch pulse analysis in the PPE coder is pitch synchronous in analysis-block- 
length and analysis-rate (Fig. 2.7c). The analysis of the noise contribution is based on 
LPAS coding with fixed-block-length, fixed-rate analysis (Fig. 2.7a). The PPE coder 
maintains a relaxed time synchrony with the original signal. The selection of the 
optimal noise contribution is performed with analysis-by-synthesis with respect to the 
time-modified speech signal. In that sense the PPE coding is related to the generalized 
LPAS. More detailed formulation of the PPE method is given in Chapter 3. 

2.5 Summary 

In this chapter we have presented the basics of linear prediction analysis and the 
principles of the LP analysis-by-synthesis coding. A number of analysis methods and 
representations of the LP excitation have been discussed (including glottal coding 
which does not use analysis-by-synthesis). 

It was pointed out that the LP excitation is often modelled as consisting of two 
parts: the voice component and the unvoiced, noise-like component. The two com- 
ponents are modelled differently: 

1. The voiced component is represented in various techniques as: 

- the output of a pitch filter (single-tap filter, multi-tap filter, interpolation 
filter to accommodate fractional pitch), 
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- an entry in an adaptive codebook (which can be created as the output 
of a pitch filter with some additional filtering, for example as in pitch- 
sharpening), 

- a glottal pulse described by a set of parameters, 

- a slowly evolving prototype waveform (SEW). 

2. The unvoiced component is represented as an entry in a stochastic codebook 
(or a sum of entries of several codebooks). 

Different coding techniques use various analysis rates and analysis-block lengths. 
CELP coders perform fixed-rate, fixed-block-length analysis. The WI uses fixed-rate 
analysis but the analysis-block lengths are pitch synchronous. The glottal coding 
analysis is pitch synchronous in the rate and the analysis-block length. 

In the context of the two-component representation, with fixed-rate analysis, the 
periodicity level of the reconstructed signal is controlled by (i) adjusting the similarity 
between the corresponding segments of the voiced component (e.g., pitch sharpening), 
(ii) changing the waveshapes of the vectors which are used to represent the unvoiced 
component (e.g., pitch synchronous innovation), (iii) varying the ratio between the 
voiced component and the unvoiced component (e.g., constraint excitation). When 
the analysis is pitch synchronous, the periodicity level of the reconstructed signal can 
also be controlled by adjusting the relative positions of the consecutive pitch pulses 
which make up the voiced excitation. 

We have introduced a general idea behind the new proposed coding model, the 
Pitch Pulse Evolution (PPE) model. In the PPE model the unvoiced component 
is superimposed on the voiced part of the excitation. The periodicity of the recon- 
structed signal is dependent on (i) the s imil arity level between the consecutive pitch 
pulses, (ii) the amount of unvoiced component superimposed on the pulses, (iii) the 
positions of the pitch pulses. The LP residual analysis in the proposed PPE model is 
pitch synchronous in the rate and the analysis-block length. 

We would like to emphasize that the ability of a coder to control the periodicity 
of the reconstructed speech is not sufficient for good perceptual quality of the coded 
signal. In the same way as there is no objective measurement for perceptual equiv- 
alence between audio signals, there is no measure of the “correct” periodicity of the 
reconstructed signal. The emphasis is, therefore, shifted from controlling the period- 
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icity to the appropriate modelling of the original signal. The PPE method constructs 
the LP excitation directly using the speech production model (the vocal tract excited 
with a series of similar pitch pulses) as a guide. In this sense the PPE coding is 
close to glottal coding. The way the LP residual is analyzed and the LP excitation is 
synthesized, however, makes the PPE method even more closely related to WI coding. 


Chapter 3 

The Pitch Pulse Evolution Model 


3.1 The PPE Concept 

In the LP coding, the voiced LP excitation represents the glottal excitation. The 
voiced LP excitation signal is composed of glottal pulses which are formed as the air 
is forced through the vocal folds into the speaker’s vocal tract. The glottal pulses are 
the result of the vibrations of the vocal folds and they are similar from one instance 
to another. 

The characteristics of the glottal excitation are reflected in the LP residual. One 
can identify individual glottal (pitch) pulses which are alike to each other. Since the 
pitch pulses are not identical, the LP residual is often described as quasi-periodic. 

We have no access to the “clean” pitch pulses of the voiced speech which cor- 
respond to the glottal pulses as they emanate from glottis. We recover the pulses 
from the LP residual, assuming that the LP coefficients model all the remaining 
components of the speech production system. This assumption, although proven to 
be adequate in the context of speech coding, is nonetheless inaccurate. Moreover, 
the LP residual even less resembles the true glottal excitation in the presence of an 
acoustic background noise. As a result, the observed pitch pulses, denoted as u, are 
contaminated with noise and they may significantly differ from the glottal pulses. We 
try to estimate the “clean” pitch pulses, written as v, from the noisy pitch pulses u 
obtained from the LP residual. We call the estimated pulses v the underlying pulses 
because, conceptually, they correspond to the glottal pulses which are at the basis of 
voiced speech production. 
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In a vector representation of a pitch pulse, if a pulse lasts for 40 samples, the 
corresponding pitch pulse vector is forty-dimensional (40-D). To compare pitch pulse 
vectors of different lengths, we pad the shorter vectors with zeros so that all the vectors 
have the same dimensionality. The dimensionality of the vectors is then equal to the 
dimension of the vector corresponding to the longest pitch pulse. The consecutive 
pitch pulses are alike and the similarity between the pulses translates into a relatively 
small error between the pitch pulse vectors. This fact has been used in practically 
every low-rate speech coder. The coding schemes employed for the voiced segments 
usually take advantage of the small difference between the consecutive pulses. In 
Fig. 3.1 we show a schematic representation of consecutive pitch pulses, which are 
portrayed as 2-D vectors. 

For the unvoiced speech the excitation of the vocal tract is generally random, and 
the error between consecutive blocks of the unvoiced LP residual is relatively large. 
A 2-D vector representation of consecutive blocks of unvoiced LP residual is depicted 
in Fig. 3.2. 

The traditional CELP approach to coding the error between consecutive pitch 
pulses is to code the orthogonal error between them (Fig. 3.3a). In the PPE coder, 
we first try to estimate the underlying pitch pulse and then code the difference between 
this calculated pulse and the observed, noisy pulses (Fig. 3.3b). 

As the vocal folds change their vibration characteristics, the underlying pitch 
pulse waveshape is not constant. The pitch pulse waveshapes change but they remain 
similar in their structure. In our methodology we call this change an evolution which 
led us to the name of our model, the pitch pulse evolution (PPE) model (Fig. 3.4). 

In the PPE model we decompose a noisy pitch pulse into the underlying pitch 
pulse and the superimposed noise (Fig. 3.5a). We estimate the underlying pulse from 
a noisy observation, which is the LP residual. With a series of the underlying pitch 
pulses we could determine how the pulses evolve (the drift of the pulses) and then 
predict the next underlying pulse. The pitch pulses of the LP excitation are created 
by adding the estimated noise to the estimated underlying pulses (Fig. 3.5b). 

We will now develop a general formulation of the PPE method. We adopt a 
notation in which the vectors obtained in the process of estimation or prediction 
are marked with the hat The tilde ” marks coded vectors available at both the 
transmitter and the receiver. 
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(a) Vector representation of a series of pitch (b) Error between consecutive pitch pulses, 
pulses. 


Fig. 3.1 Vector representation of pitch pulses for voiced LP residual. 



(a) Vector representation of a series of un- 
voiced blocks of LP residual. 



(b) Error between consecutive unvoiced 
blocks of LP residual. 


Fig. 3.2 Vector representation of unvoiced LP residual. 




(a) The orthogonal error between consecu- (b) The error between the underlying pitch 
tlve pitch pulses. pulse and the observed noisy pulses. 

Fig. 3.3 The error between pitch pulses. 



Fig. 3.4 The underlying evolving pitch pulse v and the observed 
pulses ti. 
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(a) Decomposition of a noisy pitch pulse. 
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(b) Adding noise to the underlying pitch pulse. 


Fig. 3.5 The underlying pitch pulse and the noisy pulses. 
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A vector of the LP residual corresponding to the pitch pulse found at the time 
instant i is denoted as u^\ The coded equivalent of the vector is marked as 
The past coded pitch vectors corresponding to the slowly evolving pitch pulse shape 
are written as .... 

The new pitch vector can be predicted from the past values of v and u according 
to some prediction procedure V v : 

v^,^- 2 ),...). (3.1) 

The same prediction can be performed in both the transmitter and the receiver with 
the procedure V p fixed or adaptive. 

The transmitter has access to more information, namely the uncoded versions 
of vectors u (including the past, present and possibly the future ones to the extent 
that delay is permissible) and the vectors v (unquantized past estimates). It can 
therefore form a better estimate of the present value of v according to some estimation 
procedure P e : 


v {i] =V S {..., u (i+ 1 ) ,u (i) , . . . , v (i) , vV ' l) , . . . ) . (3.2) 

Procedure V e can also use vectors u and v directly. 

We define a vector d ® which represents the unpredicted drift of the pitch vector 
and a vector which represents the unvoiced part of the vector so that 

, (3.3) 

nW = — t;W . (3.4) 

The quantized vectors d and h are used to form the coded underlying pulses v and 
the coded pulses ti which are assembled at the decoder into the LP excitation. We 
have 


■yW = yW -(- t (3.5) 

uW — y(») 4 - fjCd . (3.6) 


Note that with this formulation n^> also accounts for the quantization noise of d^K 
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In general, the transmitter performs both operations: V p and V e . The receiver 
performs the prediction V p and based on the transmitted information reconstructs an 
approximation to the waveform. 

Fig. 3.6 summarizes the notation. The diagram schematically displays the pre- 
dicted, the estimated and the observed vectors. The diagram does not show the 
“true" 


u 

V 

V 

V 

d 

n 

Fig. 3.6 Summary of the notation used in the PPE model. 

Our PPE model includes the following features: 

1. The approach does not sharply categorize speech into whether it is voiced or 
unvoiced. The proportion of the two components changes with time as shown 
schematically in Fig. 3.7. In fact, human speech production does not require 
voicing to turn off before the unvoiced part of an utterance, and some sounds 
(e.g. V,V) require both voiced and unvoiced forms of excitation. The pitch 
pulse waveform can be frozen, or adapted very slowly during unvoiced segments. 
This means that a pitch pulse waveform is available for coding the next voiced 
region and does not have to be built “from scratch” as in, for example, the 
CELP coder. 

2. We decompose the overall residual waveform into predictable and unpredictable 
components for separate coding. The predictable part is formed by the under- 
lying pitch pulses and the unpredictable component is formed by the superim- 
posed noise. 


inderlying pulse v which is unknowable. 


coded vectors 

predicted / estimated vectors 
aligned, pitch length residual vectors 
evolving pitch pulse 
predicted pitch pulse 
estimated pitch pulse 
drift of the pitch vector 
unvoiced part of the excitation 
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(a) 


unvoiced 

voiced 

unvoiced 

voiced 

unvoiced 


unvoiced ^ 

voiced 

unvoiced 

L 

^voiced^ 

^ unvoiced 


Fig. 3.7 (a) A noisy speech signal, (b) Traditional voiced/unvoiced 
division, (c) Voiced/unvoiced decomposition in the PPE model. 



The transmitter predicts the present underlying pitch pulse waveform based on 
the past coded LP excitation. It also estimates the current underlying pitch 
pulse based on the LP residual with possible look ahead. It then transmits 
(i) the difference between the predicted and the estimated pulse, (ii) the in- 
formation about the pitch pulse positions, (iii) the unvoiced component of the 
excitation. 

The receiver predicts the present pitch pulse based on the past coded pulses 
in the same way as the transmitter does. It then forms the current underlying 
pulse based on the transmitted information about the pulse evolution. The 
receiver forms the LP excitation combining the coded underlying pulses and 
the unvoiced component of the excitation. 

3. The LP residual is regarded as a series of consecutive pitch pulses. The pitch 
pulse vectors are extracted from the LP residual in such a way that: 

(i) the combined pitch pulse vectors form the original LP residual, 

(ii) the error between the underlying pitch pulse vectors and the extracted 
vectors is minimized. 
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The second condition implies that, with a relatively small drift of the evolv- 
ing underlying pitch pulse, the extracted pitch pulse vectors are aligned for 
maximum correlation between each other. The extraction is performed with 
sub-sample resolution. 

For the purpose of estimating the underlying pitch pulses and coding the pulses, 
the vectors of the extracted pulses are padded with zeros to have the same 
dimensionality. 

4. The estimation of the voiced component of the LP excitation is realized in 
the process of the estimation of the underlying pitch pulses. We describe the 
following estimation methods: 

(i) linear filtering of the pulses extracted from the LP residual (filtering with 
fixed coefficients), 

(ii) maximum ratio combining of the extracted pulses (linear filtering with 
adaptive coefficients), 

(iii) error minimization between an underlying pitch pulse and a number of the 
extracted pulses (also linear filtering with adaptive coefficients), 

(iv) an algorithm which minimizes a weighted sum of the errors between 

- a series of the underlying pitch pulses, 

- the underlying and the extracted pulses. 

The underlying pitch pulse estimation can be performed either in the time or 
in the frequency domain. 

5. Pitch interpolation based on separate interpolation of the pitch pulse length 
and the pitch pulse waveshape is employed to effectively control the periodicity 
of the LP excitation. The interpolation of the pitch pulse waveshapes is per- 
formed on the underlying pitch pulse vectors avoiding interpolation of the noise 
component. 

6. The pulse shape can be decoupled horn its final gain-scaled contribution to 
the LP excitation. Our later formulations are based on the prediction and 
estimation of normalized signals. Separate quantization of the gain and the 
pulse shape is used. 
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The PPE model also allows for a number of new approaches which may further 
improve performance of a PPE coder. For example, the smooth evolution of the pitch 
pulses depends on smooth changes in the LP analysis parameters (if different pitch 
waveforms are processed by different LP filters, unnecessary pulse-to-pulse variations 
may occur). The LP analysis can be modified to minimize the error between the 
LP residual and the target pitch pulse waveform. This approach has been investigated 
already by Zad-Issa and Kabal (1997). 

3.2 Extraction of the Pitch Pulses 

In the PPE model we view the voiced LP residual as a series of pitch pulses. In 
general, the pulses in the series overlap so that each pulse is superimposed on the 
ringing tail of the previous pulses. Every pulse has an initial high energy which falls 
on top of a small energy signal of the tails of the past pulses. The tails of the previous 
pulses are buried in a relatively high energy of a new pulse. We consider the tails 
of the past pulses as part of the noise of the current pulse. We thus regard the 
LP residual as a series of concatenated pulses in which (except the first and the last 
pulse) the end of one pulse indicates the beginning of the next pulse. 

The extraction of the pulses is equivalent to segmenting the LP residual into pitch 
pulse vectors of varying length such that: 

1. The concatenated pitch pulse vectors form the original residual. 

2. The error between the underlying pitch pulses and the extracted vectors is 
minimized. 

An optimal solution is easy to formulate: 

1. Segment the LP residual into pitch pulse vectors. Use every possible combina- 
tion of valid pitch pulse lengths. 

2. For every segmentation: 

a) Estimate the underlying, evolving pitch pulses 

b) Calculate the error between the estimated underlying pitch pulse and the 
segmented-out pitch pulse vectors. 
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3. Choose the segmentation for which the error calculated in (2) is minimized. 

Unfortunately, the solution is as easy to formulate as it is difficult to implement. 
Firstly, verification of every segmentation of a block of LP residual can be compu- 
tationally extremely expensive. Secondly, every new sample of the residual could 
change the past, already determined “best” segmentation. 

To bring the computational complexity to an implementable level, two problems 
should be addressed: 

1. How to limit the number of allowable segmentations. 

2. How to simplify the estimation of the evolving pitch pulse to reduce its depen- 
dency on too many pitch pulse vectors. 

Those two problems are related in the sense that the more we limit the number of 
valid segmentations the more complex estimation of the underlying pitch pulse we 
can afford with a fixed computational complexity. Also the simpler the estimation of 
the underlying pitch pulse, the more possible segmentations we can verify. 

The specifics of the implementation of the pitch pulse extraction algorithm are left 
for Chapter 5. Here we will just outline the main ideas used in our implementation 
which directly deal with the problems presented above. 

The number of the valid segmentations is limited by means of the following: 

1. The block of the LP residual processed at a time is fixed to a reasonable length. 

2. The boundaries of the segments must lie in the specified proximity of the ex- 
pected beginnings/ends of pitch pulses determined based on the energy of the 
LP residual. 

3. The boundaries of the segments are determined sequentially with a limited 
inter-dependence between non-adjacent segments. 

The estimation of the underlying pitch pulse is simplified. To avoid confusion 
between the simplified estimation and the estimation described in the next section, 
we call the underlying pitch pulse obtained with the simplified estimation the model 
pulse. Conceptually, model pitch pulses correspond to the underlying pitch pulses. 
The current model pitch pulse is one of the following: 
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1. The last pitch pulse vector extracted from the LP residual. 

2. The next pitch pulse vector (one pulse look-ahead). 

3. The average of the last and the next pitch pulse vectors. 

4. The average of the past pitch pulse vectors. 

The current model pitch pulse is the one of the above four which minimizes the 
prediction error with respect to the current pitch pulse (current candidate pitch pulse 
vector) . 

These limitations and simplifications enabled us to implement a pitch pulse ex- 
traction algorithm suitable for our model. The algorithm is discussed in detail in 
Section 5.3. 

3.3 Estimation of the Evolving Pitch Pulse 

In Section 3.1 we wrote the consecutive, noisy pitch pulse vectors as u W and the 
corresponding underlying pitch pulses as v^K In Section 3.2 we outlined the require- 
ments imposed on the pitch pulse extraction procedure in which we identify in the 
LP residual the noisy pitch pulses. The extraction has been presented as a segmen- 
tation of the LP residual into a series of pitch pulses. The pitch pulse vectors 
are formed from the pitch pulse segments of the LP residual by padding with zeros, 
so that all the vectors have the same dimensionality. 

We simplify the notation by dropping the time index i corresponding to the current 
pulse i . The superscript b _fc ) is replaced by a subscript and the parameter k is 
assumed to be in the range 1, . . . , N (the number N is the number of pulses used in 
the estimation method being described). We have 

u k = u and v k = . (3.7) 

We separate the gain and the shape of each vector, 


u k =Hk Uk, v k = v k Vk, n k =ct k ri k , (3.8) 

where vectors marked with an underscore are normalized with some normalizing func- 
tion W(-). The vectors can be normalized so that (i) the energy of the vector is unity, 
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(ii) the average energy of the vector elements is unity (the energy of the vector is 
then equal to the length of the vector), (iii) the maximum energy peak of the vector 
is equal to unity. We have 


Uk = N{uk ) , 2 Ik = MM and n* = U(n k ) . (3.9) 

We describe four estimation methods: linear filtering, maximum ratio combining, 
noise error minimization and total error minimization. 

3.3.1 Linear Filtering 

This is the simplest type of the underlying pulse estimation. Given a set of N vectors 
u*, we calculate v as a normalized average of the vectors u k , 

v=N(f^a kytk ). (3.10) 

fc=i 

The filter coefficients a k are fixed. The estimation is simple but it does not mean that 
choosing a good set of the linear filter coefficients a k is easy. 

In the filtering proposed for the estimation of the slowly evolving waveforms 
(SEW) in the WI coding, the coefficients a k specify a 20 Hz low pass filter (Kleijn 
and Haagen 1994b). To allow a fast update of the pulses at the voiced onsets (char- 
acterized by a relatively large change in the signal energy), the filtering is done on 
the unnormalized vectors u k , so that 

& = -A/‘(^OfcWfc) • (3.11) 

fc=i 

In the WI method the pitch pulses are extracted with a fixed rate. Depending 
on the rate of extraction and the length cf the pulses, the pulses may overlap or 
some samples will not be considered a part of any pulse. There is a constant number 
of pulses in every filtering operation as the pulses are taken from within a fixed- 
length time span. In this case specifying a k as coefficients of a fixed length low-pass 
filter is reasonable. In the context of the PPE model the pulses are extracted pitch 
synchronously so that within a constant time span we deal with a variable number of 
pulses. In this case the linear filter has to have a variable number of taps. A different 
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set of the filter coefficients would have to be specified depending on the number of 
pulses filtered. 

We found that the linear filtering with fixed coefficients is not flexible enough to 
control the characteristics of the estimated underlying pulses. The coefficients which 
performed well on one set of pulses were often inadequate for another set and vice 
versa. One set of fixed coefficients seems too much of a compromise. 


3.3.2 Maximum Ratio Combining 

In this section we estimate the underlying pitch pulse from a series of noisy pulses so 
that the signal-to-noise ratio of the underlying pulse vector is maximized. We assume 
that 


(i) the underlying pitch pulse vector v is constant in JV consecutive noisy pitch 
pulses vectors Uk , 


(ii) each vector u k is a summation of the underlying pitch pulse and the noise 
component, 

u k = PkH + akUk, (3.12) 

(iii) the vectors m , . . . , n,v are orthogonal, 


{ 0 for i £ j , 
1 for i = j . 


(3.13) 


We estimate the pitch pulse v as a linear combination of We write 

AT 

+ (3.14) 

fc=i 


and we want to choose a* so that the signal-to-noise ratio ~ is maximized. 

or 

From (3.14) and (3.12) we obtain 

AT AT 

/3v+an = + Y^( a k a kllk ) » 

Jb=L Ar=l 


(3.15) 
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so 


and 


H 


P = 52 at Pk 

k=l 


N 

an = ^2a k a k njc. 

k=l 


With (3.13) applied to (3.17) we have 


H 


T, a l a l- 


(3.16) 

(3-17) 


(3.18) 


Now 

02 (lL a kPk) 

• (3-19) 

524*1 

k - 1 

We use the Schwartz inequality: 

[52 x kVk) 2 < 52 x l 52vl • (3-2o) 

k k k 

Equality occurs only if the vectors formed by the values x k and y k are linearly depen- 
dent, i.e., Xk/yk is constant. We identify 


xjt = a k Qfc 


and 



which gives 

AT „ iV Ha 2 

(E«*A) sE< 4 «S £§• 

k=l fc=l fc=l “fc 

From (3.19) and (3.22) we obtain 


q2 hi 4 ' 


with equality only if 


QfcQfc 

Pk/ak 


= const, 


(3.21) 


(3.22) 


(3.23) 


(3.24) 
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which is satisfied for 

a k ~ ~2 • 

a i 

We form the approximation of v such that 

-^( 11 -)' 

The values fa and a* can be calculated as 


fa = ujv and a k = i/|u fc | 2 


(3.25) 


(3.26) 


(3.27) 


with respect to the last estimated underlying pitch pulse vector t?. This is adaptive 
linear filtering with the coefficients a k customized for every set of noisy pitch pulse 
vectors u k . 

Although this method seemed very promising at the beginning we soon found that 
inaccurate estimation of the ratio fajoc\ may lead to a very poor estimate of the un- 
derlying pulse v , especially at times when the underlying pulse changes. Occasionally 
the error between the estimated pitch pulse and the extracted pitch pulse would be 
larger than the error between consecutive extracted pulses. 


3.3.3 Noise Error Minimization 

In this section an estimation based on the minimization of the energy of the noise 
component is examined. We assume that 

(i) the underlying pitch pulse v_ is constant for iV consecutive noisy pulses u k , 

(ii) each vector u k is a summation of the underlying pitch pulse and the noise 
component, 

U k -fav + oc k n*, (3.28) 


(iii) the noise vectors n l5 . . . , n N , are orthogonal to the underlying pitch pulse v, 

= 0 for k = (3.29) 


3.3 Estimation of the Evolving Pitch Pulse 


49 


We want to find the underlying pitch pulse v such that the sum of the noise energies 
a k is minimized. 

k 

From (3.28) and (3.29) we have 


Pk = u h T v , a k = u k T nk , 


(3.30) 


u k T u k = p k u k T v + a*«* r Q* 


The noise energy is given by 


al = |u fc i 2 - Pi 


(3.31) 

(3.32) 

(3.33) 


so that minimization of the sum ^ a l is equivalent to maximization of the sum ^ p\. 

k k 

We write 

U = [ui •••••Ujv] and b = [/3i •• •• •• p^] T . (3.34) 

In this matrix notation, equation (3.29) becomes 


b = U T v . 


(3.35) 


We want to solve 


max \\U T v || , 
IISUK 


(3.36) 


which is the L 2 norm or maximum singular value of U T , cr L . Vector j> is the first 
right singular vector of U T corresponding to the singular value crp* 

We introduce normalization and weighting of the vectors ttfc. The former deem- 
phasizes vectors with larger energy (they may have a strong noise component), while 
the latter assigns more importance to the vectors closest to the estimation instance 
(these vectors may be a better approximation of the current vector n). Now 


v — argmax || W U T v || , 

l|i>n=i 


(3.37) 


t Vector v_ is also the eigenvector corresponding to the largest eigenvalue of the matrix UU T . 


50 


The Pitch Pulse Evolution Model 


where matrix W has weighting coefficients on its diagonal and zeros elsewhere. 

This estimation method can be seen as linear filtering with adaptive coefficients. 
In fact, from the Singular Value Decomposition (SVD) theory (Golub and Loan 1989) 
we have 

U T v = <j\Z_ and Uz = <x L v . (3.38) 

The vectors z and v are, respectively, the left and the right singular vector (corre- 
sponding to the first singular value ai) of the matrix U T . Writing a = zjo\. we 
have 


v = Ua (3.39) 

= Y^a k u k . (3.40) 

k~ 1 

The pitch pulse v is then a linear combination of vectors (or a linear 

combination of the weighted normalized vectors wiUi , . . . , lO/vUjv). The coefficients 
of the adaptive linear filter a k are generated, as in Section 3.3.2, for every set of 
vectors u k . 

This method guarantees the smallest error between the underlying pitch pulse and 
a set of the extracted noisy pulses under the assumption that the underlying pulse is 
constant within the estimation interval. The assumption of a constant pulse results 
in a lack of control over the error between the consecutive estimated underlying pitch 
pulses. We use the insights gained in this section to develop a more general estimation 
procedure, which is presented in the next section. 

3.3.4 Total Error Minimization 

In the estimation types described so far, we did not consider the error between the 
consecutive underlying pitch pulses. We estimated one pitch pulse at a time, assum- 
ing, in Section 3.3.2 and Section 3.3.3, that the pulse v is constant for N vectors 
tti, . . . , ujv- In this section we estimate a series of underlying pitch pulses while 
simultaneously controlling the error between them. 

For a series of JV noisy pitch pulse vectors tx* and the corresponding underlying 


3.3 Estimation of the Evolving Pitch Pulse 


51 


pitch pulses vjt, we write the evolution of the pulses as 


Uk = IkVk-i + (Tkdk (3.41) 

and = (3^ + a k njc . (3.42) 

We assume that 

dk T vjc - 1 = 0 for k = l,...,N (3.43) 

and njfvjc = 0 for fc = 1, . . . , iV. (3.44) 

We have 

7 k = UjJ'vj:- i , al = 1 - 7fc (3.45) 

and /? fc = Uk T vjc , or k = 1 - (3 k ■ (3-46) 

We want to find a set of N vectors y_k which will minimize the total error 

ej = ^(wcrfc + (l-a;)oi^) . (3.47) 

fc=l 

The weight u> G (0, 1) determines the relative importance of the errors d* and n* - 
Starting with a set of JV vectors v k written as {v*}, we refine one vector u at a 
time so that the error e t is reduced. The influence of a vector vs on the error e t can 
be expressed as 

e(vi) = uot-i + ( x ~ + uo f+L . (3-48) 

with a,- and of +l calculated as in (3.45) and (3-46). 

The vector v[”" n] which minimizes e(vs) can be calculated with SVD applied to 
the weighted vectors tZi-i, Ui, Urn (compare with Section 3.3.3). Let {Hfc}(i) denote 
the set {vjt} in which vector Vi is replaced with vector Since the error e(n(-' r>,n) ) 

is smaller than or equal to the error e{vi), the total error of the new set is 

smaller than or equal to the total error of the set {vk}- 
The iterative method is applied as follows: 

1. Start with an initial set of N vectors Calculate the initial error e t (0) 

and set l, the number of iterations, to 1. 
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2. If the vector u 0 is known: 

- Find a new vector which will minimize the sum of its weighted errors 
with respect to vectors 2 >o, and 

Otherwise: 

- Find the vector which will minimize the sum of its weighted errors 
with respect to vectors u_i and 

For every i~ 2, . . . , AT— 1 , find a new vector which will minimize the sum 
of its weighted errors with respect to vectors vj - Mi and 

Find a new vector vj y® which will minimize the sum of its weighted errors with 
respect to vectors ujv-i^ and u/v. 

Calculate the error e t ^ and compare it with If the difference is larger 

than a specified threshold and the number of iterations is smaller than a given 
maximum, repeat starting with step (2). 

6. Set the estimated underlying pitch pulses as: 

£i = , • • • , vk = t^ (i) . (3.49) 

When the new vectors are found using SVD, the above algorithm converges to a 
set of pulses {v*} which corresponds to a fixed point or stationary point with respect 
to the iteration operation. The underlying pitch pulses {«*} depend on the initial 
conditions, i.e., the initial set of pulses . We have obtained good results with 

the vectors set to the noisy pitch pulse vectors {tit- }. 

Performance of the Estimation Algorithm 

The prediction error between the noisy pulses Uk is written as 

4 = 1 “ 


3. 

4. 

5. 


(3.50) 
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We specify the estimation gain as the log of the ratio between the sum of the error 
energies e| and the sum of the error energies o\-\- a*, 

N 

G b = 10 log . (3.51) 

it,( a k +a i) 

k= i 

In a coder in which the differences between the consecutive noisy pitch pulses are 
coded directly (e.g., CELP in its basic configuration), the error with the energy 
£ l is coded. A positive value of Gb indicates an improvement over this basic 
approach. 

An example of the estimation of the underlying pitch pulses is presented in 
Fig. 3.8 - Fig. 3.11. In the experiment individual pulses of the LP residual are 
identified (Fig. 3.8), and the estimation algorithm, with different values of the er- 
ror weight u>, is applied to the noisy pitch pulses (Fig. 3.9 - Fig. 3.11). The total 
of 22 pulses are identified in the LP residual of the word “figure”. The estimation 
algorithm was performed on the normalized pulses. The initial set of the underlying 
pulses is set to the extracted normalized pulses of the LP residual. 



Fig. 3.8 The LP residual of the word “figure” with identified pitch 
pulses (22 pulses). The normalized, aligned pitch pulses of this residual 
are used in the example of the underlying pitch pulse estimation. 


Fig. 3.9 shows the waveforms and the errors between the waveforms obtained using 
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(h) -0.5, 


error between the pulses of (a) 
error between the pulses of (b) 
error between of pulses of (a) and (b) 


Fig. 3.9 Estimatiott of the underlying pitch pulses with the error weight 
u) = 0.5. (a) Pitch pulses extracted from the LP residual (noisy pitch 
pulses u). (b) The estimated underlying pulses v. (c) The error between, 
the underlying pulses and the noisy pulses n. (d) The prediction error 
between the noisy pulses, the prediction error between the underlying 
pulses and the orthogonal error between the underlying and the noisy 
P 
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Fig. 3.10 Estimation of the underlying pitch pulses with the error 
weight (jj — 0-9. (a) The noisy pitch pulses of the LP residual, (b) The 
estimated underlying pulses, (c) The error between the underlying pulses 
and the noisy pulses, (d) The prediction error between the noisy pulses, 
the prediction error between the underlying pulses and the orthogonal 
error between the underlying and the noisy pulses. 
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Fig. 3.11 Estimation of the underlying pitch pulses for different val- 
ues of the error weight u>. (a) The estimated underlying pitch pulses v. 

(b) The error between the underlying pulses and the noisy pulses n. 

(c) The error between the consecutive underlying pitch pulses d- 
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the estimation algorithm with the error weight u equal to 0.5. The decomposition 
of the noisy pitch pulses into (i) the underlying pitch pulses, (ii) the noise (the error 
between the underlying and the noisy pulses) is such that (i) the energy of the error 
between the consecutive underlying pitch pulse vectors, (ii) the energy of the noise 
vectors, are approximately equal. The estimation gain Ge is equal to 3.51 dB. For 
u =0.5 the estimation gain Gb is maximum because in the formula used for calculating 
Ge the error energies and ot\ are added with equal weights. 

Fig. 3.10 shows the waveforms obtained with the error weight u = 0.9. The 
estimated underlying pitch pulses are very similar (the error between the consecutive 
pulses is almost equal to zero) but the error between the underlying and the noisy 
pitch pulse vectors is high. In this case the estimation gain Ge is only 0.53 dB. 

The shift of the error energy from the error between the noisy pulses onto the 
error between the underlying pulses is presented in Fig. 3.11. For cj = 0.0 the error 
between the underlying pulses and the noisy pulses is zero which means that the 
underlying pulses are equal to the noisy pulses. The error between the consecutive 
underlying pitch pulses is equal to the error between the consecutive noisy pulses. 
This is the error which is coded in the basic-configuration CELP. With the increasing 
uj the consecutive underlying pitch pulses are more and more similar from one to the 
other, but the error between the underlying pulses and the noisy pulses increases. 
When a; = 1.0 there is one constant underlying pitch pulse for all the 22 noisy pulses 
(the single underlying pulse is obtained by applying SVD to the 22 noisy pulses). 
One can observe a large error between the underlying pulses and the noisy pulses. 
While for o>=0.9 the estimation gain Ge is 0.53 dB, for w = 1.0 the gain Ge is equal 
to —1.87 dB. It means that the error between the underlying and the noisy pulses is 
larger than the original error between the noisy pitch pulses. 

In general, the smaller the weight u the smaller is the error between the underlying 
and the noisy pulses, but the consecutive underlying pulses are less similar from one 
to the other. The larger the weight u the more similar are the underlying pulses at 
the cost of increased error between the underlying and the noisy pulses. For u in the 
proximity of 0.5 the estimation algorithm reasonably decomposes the error between 
the noisy pulses into (i) the error between the underlying pulses (the drift between 
the pulses), (ii) the error between the underlying and the noisy pulses. 
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Estimation Using Weighted Average 

Using SVD for the error minimization in steps (2)-(4) of the estimation algorithm 
guarantees a convergence of the estimated underlying pulses to a local minimum. 
SVD is, however, computationally expensive. With only three vectors involved, given 
the fact that the vectors are normalized and relatively well correlated with each other, 
SVD can be approximated with a weighted average of the vectors. Fig. 3.12 shows the 
underlying pitch pulses obtained with the estimation algorithms which used the SVD 
and the weighted average error minimization. There is almost no difference between 
the underlying pulses obtained with the two methods for cj = 0.5. Even for ui =0.1 
and u) = 0,9, the differences between the two sets of the underlying pulses are very 
small. 

The errors, the estimation gains, and the number of iterations required for the 
convergence of the SVD and the weighted average estimations for different values 
of the error weight ui are summarized in Table 3.1. In all the cases, the estimation 
algorithm was stopped when the weighted error e t specified in (3.47) changed by less 
than 10~ 6 . 

Table 3.1 Comparison between the underlying pitch pulse estimation 
using the SVD and the weighted average for different values of the error 
weight w. 


CJ 


SVD 


Weighted Average 


Zk<4 

Ge 

Iter. 



Gb 

Iter. 

0.1 

6.54 

0.00 

0.14 

2 

4.89 

0.11 

1.30 

3 

0.2 

5.75 

0.03 

0.67 

2 

3.61 

0.37 

2.30 

4 

0.3 

4.37 

0.19 

1.70 

2 

2.69 

0.79 

2.99 

5 

0.4 

2.77 

0.67 

2.93 

4 

2.00 

1.10 

3.38 

5 

0.5 

1.48 

1.53 

3.51 

6 

1.47 

1.54 

3.51 

6 

0.6 

0.70 

2.68 

3.01 

7 

1.05 

2.04 

3.39 

7 

0.7 

0.36 

3.82 

2.08 

6 

0.72 

2.64 

3.03 

10 

0.8 

0.23 

4.78 

1.30 

7 

0.44 

3.41 

2.44 

14 

0.9 

0.13 

5.85 

0.53 

12 

0.21 

4.58 

1.49 

24 


When ui is small, the underlying pulses obtained with the weighted average esti- 
mation are not as close to the noisy pulses as the underlying pulses obtained with the 
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Fig. 3.12 Comparison between the SVD and the weighted average es- 
timation for different values of the error weight uj , (a) The underlying 
pitch pulses estimated using the SVD method («svd)' (b) The underly- 
ing pulses estimated using the weighted average method (% A ). (c) The 
difference between v S vd and Vwa- 
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SVD estimation (the error energy ot\ is larger for the weighted average estima- 
tion) . When w is large, the weighted average estimation requires more iterations and 
does not produce the underlying pulses as smoothly evolving as the SVD estimation 
(the error energy * s smaller for the SVD estimation). The weighted average 

estimation works, however, effectively when the error weight u is near 0.5. 

3.4 Summary 

In this chapter we have formulated the principles of the PPE method. Based on 
the adopted speech production model, the LP excitation is modelled as a series of 
underlying pitch pulses buried in noise. The noisy pitch pulses are extracted from 
the LP residual and the underlying pitch pulses are estimated based on the extracted 
pulses. 

The transmitter predicts the current underlying pulse based on the past coded 
LP excitation, estimates the current underlying pulse based on the current LP resid- 
ual, and transmits the difference. The information about the pitch pulse positions 
and the unvoiced component of the LP excitation is also transmitted. The receiver 
predicts the current underlying pulse in the same way as the transmitter does, and 
based on the transmitted information about the pulse evolution, refines this predic- 
tion. The LP excitation is formed by placing the created pulses at the coded pulse 
positions and adding the unvoiced component. 

We have outlined the requirements imposed on the pitch pulse extraction algo- 
rithm. The LP excitation is regarded as a series of consecutive pitch pulses so that 
the concatenation of the extracted pitch pulses should form the original LP residual. 
The straightforward formulation of the problem of the pitch pulse extraction leads to 
a computationally very expensive system; we have indicated directions for reducing 
the computational burden to implementable levels. 

The pitch pulse extraction is one of the most important parts of the PPE system 
on which the rest of the coder performance is dependent. The extraction procedure 
is especially designed to fit the adopted speech production model and it performs a 
number of operations which are distinct in other coders (i.e., the WI coder). The 
PPE pitch pulse extraction combines (i) pitch period estimation, (ii) pitch pulse 
extraction, (iii) alignment of the extracted pitch pulses. The formulation of the pitch 
pulse extraction further includes simple estimation of the underlying pitch pulse which 
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corresponds to the SEW/REW separation used in the WI coding. Since the pulse 
extraction is of central importance, it has been given particular attention in our work. 
A practical implementation of the pitch pulse extractor is presented in Section 5.3. 

Various methods for estimating the underlying pitch pulses have been presented. 
In particular we have examined linear filtering with fixed coefficients, maximum ratio 
combining, noise error minimization and total error minimization. The last technique 
provides the most flexible framework in which the properties of the calculated pitch 
pulses can be directly influenced by a single parameter, the error weight. 

We have developed an iterative algorithm as a practical way of finding a solution to 
the non-linear optimization problem which results from the total error minimization 
approach. The performance of the algorithm has been illustrated and the SVD and 
the weighted average versions of the algorithm have been compared. We have shown 
that the weighted average estimation, while computationally much less expensive, 
produces similar results to the SVD estimation. 
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Chapter 4 

Interpolation of the Pitch Pulses 

Interpolation of the pitch involves creating intermediate pitch pulses between given 
points in the waveform. A pitch pulse is characterized by (i) the pitch pulse length, 
and (ii) the waveshape which forms the pulse. Pitch interpolation techniques include 
those used in Sinusoidal Transform Coding (STC) (McAulay et al. 1991, Brand- 
stein et al. 1991, McAulay and Quatieri 1995), Waveform Interpolation (WI) coding 
(Kleijn and Granzow 1991, Kleijn 1993, Kleijn and Haagen 1995b), and Relaxed- 
CELP (RCELP) (Kleijn et al. 1994). In all three methods (STC, WI and RCELP), 
the interpolation of the pulse length and the interpolation of the pulse waveshape are 
part of one interpolation procedure. 

In the PPE model we decouple the interpolation of the pitch pulse length and the 
interpolation of the pulse waveshape to gain a greater control over the evolution of 
the pulse characteristics. We argue that such an approach is justifiable from the point 
of view of the speech production model. Our experiments and observations of the 
LP residual indicate that indeed the waveshape of a pitch pulse is largely independent 
of the pitch pulse length and, therefore, should be considered separately. 

In this chapter we examine how the interpolation techniques used in other coders 
influence the pitch pulse length and the waveshapes of the interpolated pulses. We 
describe the PPE interpolation of the pitch pulse lengths in which the waveshapes of 
the pitch pulses are not changed. Finally, spectral interpolation used in STC and WI 
are presented. We adopt the spectral interpolation to modify the waveshapes of the 
pitch pulses without changing the interpolated pitch pulse lengths. 
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4.1 Pitch Pulse-Length Interpolation 


4.1.1 Periodic and Quasi-Periodic Signals 

A periodic signal is composed of a number of waveforms which have the same length 
and identical shapes. The period of the signal p(t) is constant and is equal to the 
length of one waveform. The phase of the signal (j>(t ) increases linearly in time and 
changes by 2i r within one period. 

A quasi-periodic signal is composed of a number of waveforms which may not have 
the same length and which are only alike in shapes. We can describe the periodicity 
of a quasi-periodic signal using the phase of the signal, <p(t). We define the following: 


(i) the instantaneous period p(t) which reflects the local variations of the phase, 


1 _ 1 #(t) 
p(t) 2tt dt 


(4.1) 


(ii) the interval P(t) 
increases by 2x, 


which is the time period over which, starting at t, the phase 


Pit ) = min r —t. 

<p(T)-4>(t)=2TT 


(4.2) 


Given the initial phase 0(to), the instantaneous period p{t) determines the phase 4>{t) 
as 


<t>(t) = 4>(t 0 ) + 2 t f ~ -dt . (4.3) 

Jto p{t) 

The interval P(t) can be calculated from the instantaneous period p(t) by solving the 
equation: 


rt+P{t) X 

Jt p(t ) 


dt- I. 


(4.4) 


In a quasi-periodic signal the interval P(t) corresponds to the time which separates 
two consecutive similar-shape features of the concatenated waveforms. For a periodic 
signal the interval P(t ) and the instantaneous period p(t) are equivalent and they are 
equal to the constant period of the signal. For a quasi-periodic signal the interval 
P(t) and the instantaneous period p(t) are, in general, different. 
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The LP Residual 

For voiced speech, the quasi-periodic LP residual is composed of a series of pitch 
pulses of varying lengths and shapes. We mark the beginnings of two adjacent pitch 
pulses as U and t i+ i. We specify the time interval P(U) as 

P{ti) = ti+i - U (4.5) 

so that the interval P(U) is equal to the length of the pitch pulse beginning at time 
ti. Pitch interpolation results in the change of the pitch pulse positions £* and ti +l 
into U and i i+l respectively. The pitch-interpolated pulse begins at time f t - and is of 
length Pi(ti), 

P[(ti) = ti+i — U . (4.6) 

Various types of pitch interpolations transform the waveshape within the interval 
P{U) into the interval Pr{ti) in different ways. We examine the pitch interpolations 
in the light of this transformation. 

4.1.2 Pitch Interpolation in Existing Coders 
Waveform Interpolation Coder 

In the context of WI the pitch period is viewed as the instantaneous period p(t). 
The period p(i) is estimated at regular time intervals, coded and transmitted. The 
interpolated instantaneous period pr{t) is created by linear interpolation of the trans- 
mitted values of p(t). The LP excitation is formed based on the interpolated period 
pi(t) which determines, given the initial phase 4>ro(t): the evolution of (p[{t). 

Consider a pitch pulse which starts at time ti and whose length is P{U). In WI 
the instantaneous pitch period p(t) is estimated without explicit concern with the 
evolution of <p(t) . In particular, even if the instantaneous period p(t) is estimated for 
every time t , the phase 0(f) determined from the estimated p(t) may not change by 
27 r within the time interval P(t). 

The instantaneous period p/(t) is formed by interpolating the transmitted values 
of p(t). The interpolated period Pr{t) determines the phase 4>[ (t) , but the employed 
interpolation of p[(t) (linear interpolation of the coded p(t n )) provides no explicit 
constraint on the evolution of (which determines the pitch-interpolated pulse 
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length P[(ti)). The difference between P[{ti) and P(£*) results in the loss of time 
synchrony between the original and the reconstructed signal and it may accumulate 
over a number of pulses. Eventually, addition or deletion of pitch pulses may occur. 

In WI, pitch pulse waveshapes are length (phase) normalized, coded and transmit- 
ted. The signal Is reconstructed by applying the interpolated phase </>/(£) to the phase- 
normalized waveshapes. The evolution of the phase </>/(£) is continuous and in effect 
the original pitch pulse of length P(U) is time-warped to fit the pitch-interpolated 
pulse length Pr(ji)- The time warping is determined by the implicit evolution of the 
phase 4>r(t). 


Sinusoidal Transform Coding 

In STC, a set of frequencies /*,(£) and their phases (f>k{t) are identified. In general, the 
set {/*(£)} may include any frequencies but, in low bit rate coders, it often consists 
of the harmonics of the fundamental frequency f(t). The fundamental frequency /(£) 
is the inverse of the instantaneous period p(t), 



(4.7) 


The frequencies fk(t) and the phases are interrelated, 


fk(t) = 


1 # fe (£) 

2 tt dt 


(4.8) 


and t 

0*{£) — 0fc(£o) +27T f fk(t)dt . (4.9) 

J to 

The frequencies fk[t) corresponding to each analysis frame axe estimated, coded 
and transmitted. The interpolation of the coded frequencies /*(£„) is quadratic. Since 
there are many possible quadratic paths which the interpolated frequencies fk r {t) can 
follow between the coded updates, the initial phase 0*, (to) and the frequencies corre- 
sponding to consecutive frames /*,(£„) are not sufficient to determine the interpolation. 
By specifying an extra condition for every coded frequency, the piece-wise quadratic 
evolutions of /* r (t) can be chosen such that the cubic evolutions of the phases 0*/ (£) 
satisfy the imposed boundary conditions. Given the initial phases 0jt z (to), the phases 
cj) k[ (£) determine the reconstructed signal. By imposing the boundary conditions 
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on <£*.,(£), the time synchrony between the original and the reconstructed signals is 
maintained. The encoding of the information about the boundary condition, however, 
increases the coder bit rate. 

In STC the signal is reconstructed by the summation of the waveforms obtained 
by applying the interpolated phases <j>k[{t) to a set of sinusoids. Although the time 
synchrony between the original and the reconstructed signals is retained, the lengths 
of individual pitch pulses before the interpolation P(i z ) and after the interpolation 
Pi(ti) may differ. During the reconstruction a pitch pulse of length P(U) is in effect 
time- warped to fit the interpolated pulse length Piiti). The warping is determined 
by the evolution of the phases ^(t). 

Relaxed- C EL P 

In a CELP coder with an adaptive codebook, the LP excitation is formed as a sum of 
the adaptive codebook entry and a fixed codebook entry. First the adaptive codebook 
is created from the past LP excitation. Then the adaptive codebook contribution and 
the fixed codebook contribution are selected based on a weighted error between the 
original and the reconstructed speech. In a generalized analysis-by-synthesis proce- 
dure, the original speech is time-scale modified and the weighted error is calculated 
with respect to the modified signal. 

The Relaxed-CELP (RCELP) coders use the generalized analysis-by-synthesis ap- 
proach. For every frame, the RCELP coder estimates and encodes the instantaneous 
period p(t n ). The interpolated period pi{t) is formed by linear interpolation of the 
transmitted p(t n ). The adaptive codebook contribution is formed, at the transmitter 
and at the receiver, by time warping the past LP excitation. The time warping is such 
that the instantaneous period of the created signal matches, in the current frame, the 
interpolated period pr(t) ■ The modified speech signal used in the generalized analysis- 
by-synthesis procedure is obtained from the time-scale modified LP residual. In the 
time-scale modifications, blocks of the original LP residual are time shifted. 

The adaptive codebook contribution is used as the reference vector with respect to 
which the time-shifts of the LP residual are optimized (the minimized perceptually- 
weighted error is calculated in the speech domain). In effect the time warping is used 
to create the reference vector for the shifting procedure but is not used to actually 
modify the LP residual. By limiting the maximum allowable accumulated shift, the 
time synchrony between the original and the modified signals is maintained. 
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The fixed codebook contribution is chosen based on the weighted error between the 
reconstructed speech and the speech obtained from the modified LP residual. There 
is an inconsistency in this procedure: the adaptive codebook contribution is formed 
by time warping and the fixed-codebook target vector is based on the LP residual 
modified with time shift ing t . For best results (i.e., for a best match between the 
adaptive codebook pulses and the modified LP residual pulses) both signals should 
be created with the same procedure: either time warping or time shifting. 

In the description of RCELP, the use of time shifting is presented as a computa- 
tional saving over more computationally expensive time warping. It is implied that, 
computational complexity aside, time warping is the preferable method of forming 
the modified LP residual. It was reported that a system with the time-warped LP 
residual had a significant increase in coding efficiency (Kleijn et al. 1993). Also time 
shifting could be applied to both, the LP residual used to form the modified speech 
and the past LP excitation used to create the adaptive codebook. We do not know 
of such a system having been tested. 

4.1.3 Is Time Warping Justified? 

Time warping results in “continuous evolution” of the waveforms but it changes the 
“internal structure” of a pulse by time stretching or contracting the pulse. Is it the 
right thing to do? 

We have carried out a number of experiments to see if time warping is justifiable 
based on what happens to the original LP residual waveform in the case of rapidly 
changing pitch. We recorded a vowel “a” spoken with a quickly rising pitch. We then 
examined the behaviour of consecutive pitch pulses and tried to determine the relation 
between waveshapes of pulses with different lengths. Given two pulses, occurring in 
the same voiced region but one being considerably shorter that the other, we wanted 
to determine if the longer pulse could be formed by “stretching” the shorter pulse. 
In this case one pulse would be a time-warped version of the other. Or, if the longer 
pulse could be formed by letting the shorter pulse “ring” longer. We did not try 
to predict this “ringing” and we simply padded the shorter pulse with zeros to the 
length of the longer pulse. 

tin the context of a time window which contains few pitch pulses time shifting can be viewed as 
a form of a discretized time warping. We are interested in what is happening within one pitch pulse, 
within the time interval P{t%), and in this context time warping and time shifting are distinct. 
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Our observations show that a longer pulse can, in fact, be much better approxi- 
mated by the “ringing” version of the shorter pulse and not by its “stretched” version. 
Fig. 4.1 depicts a series of pitch pulses extracted from the LP residual of the recorded 
vowel “a”. Between the pulses plotted in Fig. 4.1a and the pulses plotted in Fig. 4.1b, 
the LP residual contained only four extra pitch pulses (not shown on the figure) . The 
pulses of Fig. 4.1a and Fig. 4.1c are identical. The pulses of Fig. 4. Id are created 
by padding the pulses of Fig. 4.1b with zeros. The pulses of Fig. 4.1e are formed 
by time warping (which in this case is equivalent to time stretching) the pulses of 
Fig. 4.1c. We can observe that the pulses padded with zeros (Fig. 4.1d) maintain the 
alignment with the longer pulses (Fig. 4.1c) and the pulses stretched (Fig. 4.1e) lose 
this alignment. Based on our experiments, we believe that time warping of the LP 
residual while changing (interpolating) the pitch is not a proper thing to do. In our 
pitch interpolation we allow only time shifting of the pitch pulses, effectively changing 
the length of the pulses but avoiding any time warping within a pulse. Time shifting 
may introduce signal discontinuities at time-shift boundaries. This, however, is not 
perceptually noticeable if the discontinuity occurs in a region of low energy, which is 
the case when the pitch pulses are properly extracted. 

4.1.4 Pitch Pulse-Length Interpolation in the PPE Model 

In the PPE model we code the average pitch and not the local pitch associated with 
one time instant. We code the length of a few pitch pulses and our interpolation can 
be viewed as a segmentation of the encoded length into the correct number of pulses. 
In this way we preserve the pitch “on average” and maintain the time synchrony of 
the reconstructed speech with the original. 

The pitch information is coded once per frame. For frame k, we encode the 
position of a pitch pulse r fc and the information about the number of pulses between 
Tfc and the pulse position coded in the previous frame r^-i. We write the number of 
pulses between rt_i and r fc as N k . In general, there are more than one pitch pulse in 
a frame. The problem of selecting the pulse whose position will be code is deferred 
to Section 5.4.1. 

The reconstructed signal is time-synchronous with the original at least once per 
frame, at the time instant r k . The average pitch pulse length is preserved although 
the intermediate pitch pulse lengths may be different from those in the original signal. 





L from the LP residual of a 
tch. (c) The same pulses as 
e) Pulses of (b) stretched. 
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We refer to the segmentation of a block of r k -i — r k samples into N k pulses as the 
pitch pulse-length interpolation. In pulse-length interpolation each segment corre- 
sponds to an interpolated pitch pulse length Pr(ti). The segmentation is determined 
by an interpolation specified in one of the following domains: the interval P(t), the 
fundamental frequency f(t), or the instantaneous period p(t). Whatever the domain, 
the interpolated pitch pulse length Pi{U) is calculated and the signal is reconstructed 
based on the relation between P(U) and P/(t,). The pitch pulse of length Pi(U) is 
formed from the corresponding pitch pulse of length P(t;) either by truncating the 
original pulse (if P{ti) is longer) or by adding to the original pulse a suitable extension 
(if P{ti) is shorter). There is no time warping. 

The description of the interpolations in the different domains follows. 


Interpolation in the Pitch Pulse Length Domain P(t) 

The encoded position of frame k- 1 marks the beginning of the first pulse of the region 
coded in frame k, 

to = r k - 1 , (4.10) 

and the coded position of the frame k marks the end of the last pulse in the region, 



II 

(4.11) 

We require that 

E Pi(tj) = i[f k - t Q 
j= o 

(4.12) 

where P/ (£,•), . . ., Pi{t^ k -i) are the pitch pulse lengths interpolated in frame k. 
Linear interpolation in the P(t) domain is performed as: 


tj- hi — Pt(Pi) J 

(4.13) 


where Piit- 1 ) is the last pitch pulse length interpolated in the previous frame. We 
call this interpolation linear because the difference between consecutive pulse lengths 
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within a frame is constant, 


d k = P/(£>) - 1 ) = const , 0 <j<N k . (4.14) 


The difference between the pulse lengths is calculated as 


dk = 2 


(jk - rjfc-Q - N k P r (Li) 
N k (N k + 1) 


(4.15) 


In the PPE coder described in Chapter 5, we use pulse-length interpolation based on 
this linear interpolation in P(t). 


Interpolation in the Fundamental Frequency Domain f(t) 

For the interpolation in the /(£) domain, we calculate the interpolated pitch pulse 
lengths P[(ti) from the interpolated fundamental frequency //(f). 

For the frame k, we have 


/ fr(t)dt = N k . (4.16) 

J Tk- \ 

The coded time intervals are written as 

Tfc-l — 'Tjfc-l — T k -2 , (4-17) 

T k = r k — Tfc_/. (4.18) 

r*+i = Tfc+i ~ r k . (4.19) 


Linear Interpolation 

The linearly changing frequency //(f) can be specified in the frame k as 

f[(t) — a k (t — r k -i) +b k , r k ~i<t<T k . (4.20) 

The following approaches can be used to calculate the parameters a k and b k . 

(i) Applying (4.16) for the previous and the current frame, we obtain a set of linear 
equations: 

f a k T k -i + &fcTfc-i = N fc-i 
1 a kP k + b k T k = Nk , 


(4.21) 
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which we solve for a* and b k . In this formulation a* and 6* do not depend on the 
a’s and 6’s calculated for other frames. The resulting function fi(t) is piecewise 
continuous with jumps at the coded positions {rk}. 

(ii) To make the function fi(t) continuous, the parameter b k can be calculated from 
the values at-t and b k ~i used in the previous frame, 


bk — Ofc-iTfc-i 4- bk-i . (4.22) 

We can obtain ak as 

Nk-bkTk 

ak ' n ' 

Quadratic Interpolation 

Assuming that within a frame fi{t ) changes quadratically we have 

fr{t) = a k (t - T k -i) 2 +b k (t- T k -i) 4 - c fc , r k - 1 <t < r k . 

To calculate the parameters a k , bk and c k we can use one of the following methods. 

(i) We solve the set of linear equations 

O'k^'k—l bk'I'k—l d* CfcTfc— i = Nk—l 
< 4- bkTk 2 + CkTk = A Ik (4-25) 

u<;T fc 3 +1 4- bkTk + i 4- CkTk+i = Nk +i . 

This interpolation will result in piecewise continuous with discontinuities 
at the coded positions {t*}. 

(ii) To make fr(t) continuous, we set 

Ck = at-iTt-i 4- bk-iTk-i 4- c fc _i , (4-26) 


( 4 . 23 ) 


( 4 . 24 ) 


and solve for at and bk in 

| a k T*+bkT 2 = N k - Ck Tk 
\ UfcT 3 +1 4- b k T^ +L = N k +i — c k Tk+i . 


( 4 . 27 ) 
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(iii) Using the description of the future pulse lengths to determine the lengths of 
the pulses in the current frame increases the coding delay. We can calculate the 
interpolation coefficients without the delay if we require that /(£) is not only 
continuous but also smooth (the first derivative of //(£) is continuous). This 
leads to the set of linear equations: 

cjt = ak-iT^ + bk-iTic-i + Cfc_ i 

< 2a k T k -i + fejfc = 2afc_iTfc_i + bk-i (4.28) 

^ a k T> + b k T 2 k = N k - c k T k . 

which we solve for a k) b k and c k . 

(iv) With extra bits of information we could specify a different boundary condition, 
based on which we could calculate the quadratic interpolation parameters of 
fr{t). This approach is used in STC. 

Calculating the Pitch Pulse Lengths 

Once the //(£) interpolation parameters are determined, we calculate the interpolated 
pitch pulse lengths P/(£ t ). With i Q = 


ft j-K 

I = i 

— £/+ i 


for 0 < j < N k . 


(4.29) 


Note that, based on (4.16), we have = r k . 


Interpolation, in the Instantaneous Period Domain p(t) 
The interpolated instantaneous period pi(t) should satisfy 

r ~ TfT * = ^ ' 

Jr k -i Pl{t ) 

In the linear interpolation of p(t) 


(4.30) 


Pr{t) = a k (t - T k - 1) + b k , r fc _ i <t<r k . 


(4.31) 


4.2 Pitch Pulse-Shape Interpolation 
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We can calculate a k and b k in one of the following ways. 

(i) With T k ~i and T k as specified in (4.17) and (4.18), we solve for a k and b k in the 
set of non-linear equations 


b k e a ^~ l = a k T k _! + 1 
b k e akNk = a k T k + 1 . 


(4.32) 


(ii) We ensure the continuity of p(£) by specifying the conditions as 


{ 


hit = a k -iT k -\ 4- b k - 1 
b k e aicNk = a k T k + 1 . 


(4.33) 


Even in linear interpolation, the calculation of the pitch pulse lengths from the in- 
stantaneous period Pr{t) leads to a set of non-linear equations. Linear interpolation 
in p(t) is used in WI but the pulse lengths are not calculated there. 

Calculating the Pitch Pulse Lengths 

With to = Tfc_x, the interpolated pitch pulse lengths are obtained with: 



1 

Pr{t) 


dt = 1 


> 


P/(tj) — tj + 1 tj 


0 < j < N k . 


(4.34) 


Based on (4.30) we have t^ k = r k . 

The pitch pulse-length interpolation used in the PPE coder is described in Sec- 
tion 5.4.2 and Appendix A. 


4.2 Pitch Pulse-Shape Interpolation 

4.2.1 Spectral Interpolation 

Spectral interpolation refers to the idea of the reconstruction of a two-dimensional 
time-spectrum signal X[(t,u ) from, an ensemble of spectra X(ti,u), Xfa, cj), . . . , 
Given a one-dimensional time signal y[t ), first a two-dimensional signal x(t, r) is 
formed and then a two-dimensional time-frequency signal X(t,uj) is generated by 
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applying to x(t, r) a transform function with respect to r. Sampling of X(t, cj) results 
in a set of spectra X(t x , oj), Xfa, w), . . . . The signal Xr(t, u) is formed by performing 
spectral interpolation on the ensemble X(t\,u), X(t 2 , w), .... A new signal X[(t, r) 
is constructed through an inverse transform of the interpolated Xf(t,u). Finally a 
one-dimensional time signal yr{t), akin to the original y(t), is obtained. 

This idea has been applied initially, in the context of speech analysis, to the prob- 
lem of time-scale modification in which the rate of the speech is changed without 
change of the perceived pitch. Portnoff formulated a mathematical representation 
of speech and described, based on that representation, a mapping between the one- 
dimensional speech signal and a two-dimensional time-frequency signal through the 
Short-Time Fourier Transform (STFT) (Portnoff 1981a, b). The objective was that 
“temporal features" of the speech appear as a functions of the time variable and 
“spectral features” appear as function of the frequency variable. The desired rate 
change of the speech signal was achieved by time-decimation/interpolation of the 
spectral representation parameters. Signal estimation from the modified, i.e., inter- 
polated, STFT was studied in (Griffin and Lim 1984) and led to an improvement and 
simplification of the Portnoff system. 

At present, speech analysis based on sinusoidal representation and spectral inter- 
polation are used extensively. Harmonic coding (Almeida and Tribolet 1982, Marques 
et al. 1990), Sinusoidal Transform Coding (McAulay and Quatieri 1986, McAulay 
et al. 1991, McAulay and Quatieri 1995), and Multi-Band Excitation (Hardwick and 
Lim 1988, 1989, Brandstein et al. 1990) all use interpolation in the frequency domain. 
Spectral interpolation has been applied to coding the quasi-periodic LP excitation of 
voiced speech in the PWI method (Kleijn 1991, Kleijn and Granzow 1991, Kleijn 
1993). As mentioned in Chapter 2 the LP model assumes separability between the 
excitation and the formant structure modelled by the LP filter. Separate analysis 
and interpolation are therefore carried out on the LP coefficients and on the features 
extracted from the LP residual. Individual pitch pulses obtained from the LP residual 
are aligned, coded infrequently and then interpolated. 

The spectral interpolation is also employed in Time-Frequency Interpolation (TFI) 
(Shoham 1992, 1993b). In TFI, the waveform normalization and alignment are elimi- 
nated. The inverse Fourier transform is modified so that the reconstructed overlapping 
blocks of the LP excitation cm be interpolated in the time domain. 


4.2 Pitch Pulse-Shape Interpolation 


77 


In addition to mere interpolation of the spectra, Kleijn suggested other modifi- 
cations of the two-dimensional signals (Kleijn and Haagen 1994a, b, 1995a). In WI, 
the spectra are filtered with respect to time t to separate their slowly and rapidly 
changing components*. The time signals corresponding to these separated spectra 
are called the slowly evolving waveform (SEW) and the rapidly evolving waveform 
(REW). In the context of speech compressing, different coding strategies can be em- 
ployed with respect to the SEW and the REW (e.g., interpolation of the waveforms, 
uneven bit assignment). 

We now describe the spectral interpolation in more detail. 

Spectral Interpolation in WT 

The LP residual y(t) is represented as a two-dimensional signal u(t,r) created from 
segments of y(t). The instantaneous pitch period p(i) of the signal y(t) is estimated 
and then u(t, r) is formed from the segments of length p(t) extracted from y(t). The 
extracted segments are centered near t and the signal u(t, r) is given by 

u(t, t) - y(t + t) , n(£) < r < r 2 (t) , (4.35) 

where ti(£) and r 2 (£) are such that p(t) = r 2 (t) — t\(£). 

In fact u(t : t) is formed only at discrete time instances £ t - so that an ensemble 
(u(£t,r)} is created. Each of the signals in the set {«(£,-, r)} is normalized in the r 
domain forming a length-normalized 

u(ti, <t>) = u(ti, * ( 4 - 36 ) 

The signals in the ensemble {u(ti,4>)\ are periodically extended in d> and aligned 
for maximum correlation with respect to the previous, already aligned signal of the 
ensemble, 

xfa <f>) - u{t u 4> + ip) (4.37) 

tFor a fixed frequency u 7 the two-dimensionaL signal X(t r ui) becomes one-dimensional X u (t) - In 
the filtering of X(tjU/) with respect to t , the set of the one-dimensional signals X^t) is filtered. 
The filtered two-dimensional signal corresponding to X(t 9 u/) is reconstructed from the filtered one- 
dimensional signals. 
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with 

ip = arg max( J u(t it <p 4- ip)x , (p)d(p ) . (4.38) 

In the description of WI the signals In the ensemble {z(£i, <j>)} are called characteristic 
waveforms. 

Since x(ti, <p) is periodic in 2 tt with respect to <p, its Fourier series is given by 

X(U, k) = ±- f x{ti,4>) e-Md<p . (4.39) 

47T J 2ir 

The number of significant coefficients M ti is determined by the pitch period p(t{) and 
the bandwidth of the signal u{t^r) with respect to r. Since M ti depends on p(£i), 
the calculation of the spectra X(U, k) requires the use of Discrete Fourier Transform 
(DFT) of different lengths, viz fast Fourier techniques cannot be used for all M ti . 

The Fourier transforms X (U, k ) can be viewed as a sampled (decimated) version 
of the time-continuous signal X(t,k). The missing spectra are obtained through an 
interpolation procedure X, 

X[(t, k ) = l{X(t u k),, X(t 2 , k), . . . , X(t h k), . . .} . (4.40) 

A multitude of interpolations I cam be specified but only linear interpolations have 
been used. Linearity of the interpolation X is understood in the sense that 

X{X{t^ u k), X(t u k)} = a(t)Ar(£,_i, k ) + p){t)X{t u k ) , ^ <t<U (4.41) 

with the interpolation coefficients a(t) and (3{t) not necessarily linear in time. The lin- 
ear operator X can also be applied separately to the spectral magnitude and phase, in 
which case the interpolation is non-linear with respect to the complex values Xfc, k). 
The signal x/(£, <p) is created through the inverse Fourier transform of Xi(t, /c), 

*>(*.*)«**• (4-42) 

k 

The signal p/(£) can be obtained from r/(£, <f> ) as 

/ r ^ 27 r \ 

Stz(f) + J' 


(4.43) 
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where <j>{ti- 1 ) is the initial phase at time such that = Xf(t, (pfc-i)). 

Spectral Interpolation in TFI 

In TFI, the discrete LP residual is not regarded as continuous signal. The signal 
n(n, m) is formed based on the discrete instantaneous pitch period p(t ) , 

u(n, m) = s(n -f- m ) , mi(n) < m < m 2 (n) (4.44) 

where mi(n) and m 2 (n) are such that 

p(n) = ra 2 (n) — mi(n ) . (4.45) 

Only a decimated version of the signal n(n, m) is created so that the ensemble 
{u(rii,m)} is formed. 

The TFI coder does not normalize the waveforms {u(rii,m)} with respect to 
m. One cannot therefore proceed the same way as in spectral interpolation used in 
WT: waveform alignment, Fourier transform, interpolation, inverse Fourier transform, 
mapping from the two-dimensional signal to the one-dimensional signal. In TFI it is 
assumed that these operations commute (Shoham 1992, 1993b). The waveform align- 
ment is eliminated and the inverse DFT is based on an approximation to time-scale 
modification. The modified inverse DFT allows a gradual change of the instanta- 
neous period of the time waveform. This is achieved by making the phase of the basis 
functions of the transform independent of the DFT size, and changing it according 
to the required instantaneous period. Linear interpolation is not performed on the 
decimated version of X(n,k ) (the ensemble {X(rii, &)}), but on the two-dimensional 
time signal obtained through the modified inverse DFT, the ensemble {^/(n^m)}. 
The DFT is calculated as 

m 2 (n,-)-l 2i( . 

X(rii, k) — 52 ufa *! m ) e m , k~ 0, — , p(n,-) — 1 . (4.46) 

771=7711 (Tit) 

The two-dimensional signal m) is obtained via the inverse DFT modified to 

- 52 X(Tii, k) . 

fc= o 


(4.47) 
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The phase of the basis functions of the inverse transform $(nj, m) is computed as 

$(rii,m) = $(ni,rii) + 2 tt fr(ni,m) (m - rii) (4.48) 


with the initial phase 


- , v 2 tt 
Q{ni, Tli ) = — rtli . 

pK) 


The interpolated frequency //(n,-, m) is given by 


fr{rii,m) = 


f 1 ~ap(m) a P (m) 
p(m- 1) p(«i) 


for < m < rij , 


1 - a p (m) , a«(m) . 

+ ~ r v for m<m< m +l 

p{m) p(n i+ i) 


The interpolation coefficient or p (n) is specified as 

nmodiV 


o p (n) = 


JV 


A linear interpolation is applied to the two-dimensional time signals 

yi(n) — (l — oc(ji) s j x [(rii—i, n) + ct(n)xr(ni, n ) , n t -_ i <n <rii 


(4.49) 


(4.50) 


(4.51) 


(4.52) 


with 

a(ri) = o:p(n) . (4.53) 

In both WI and TFI, the spectral interpolation is performed with fixed rate, i.e., 
the time interval i in the WI and the interval n t - — rit_i in the TFI coder are 

constant. 


4.2.2 Spectral Interpolation in the PPE model 

In the PPE coder, spectral interpolation is used to interpolate waveshapes of pitch 
pulses. The pitch pulses are extracted from the LP residual in such a way that 
the pulses are aligned. Every pitch pulse is regarded as a separate entity and the 
pulses are padded with zeros to a common length. The underlying pitch pulses are 
estimated in the time domain and the estimated pulses are transformed into frequency 
domain with the Fast Fourier Transform (FFT) algorithm. Use of the computationally 
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efficient FFT is feasible because all pitch pulses are of the same length. We avoid the 
computationally expensive direct DFT calculations necessary in the WI and the TFI 
coders. 

Once per frame, the pitch pulse-shape update information is transmitted. The 
waveshapes of the intermediate pulses are formed by linear interpolation. Since the 
number of pitch pulses varies from frame to frame, the number of pitch pulse shapes 
to be interpolated is not constant. Our interpolation is linear “in the number of 
interpolated pulses” and not linear “in time” . It means that the coefficients of the 
linear interpolation depend on the number of pulses to be interpolated and not on 
the relative positions (lengths) of the pulses. 

We do not modify the spectra of the pulses prior to interpolation. The smoothing 
of the evolution of the pulses is performed in the time domain in the process of 
estimation of the underlying pitch pulses as described in Section 3.3.4. The estimation 
of the underlying pulses can also be performed on the pulse spectra. This would 
correspond to the spectral modifications (filtering with respect to time t) employed 
prior to the interpolation in WI. 

4.3 Summary 

In this chapter we have analyzed the pitch interpolations used in WI, STC and 
RCELP. We have argued that the time warping of the pitch pulses resulting from 
the interpolations employed in those coders is not justified from the point of view of 
the characteristics of the LP residual. In the PPE model, the interpolation of the 
pitch pulse length and the interpolation of the pitch pulse waveshape are decoupled 
and the time warping is avoided. 

The interpolation of the pitch pulse lengths has been described in terms of (i) lin- 
ear interpolation of the pulse lengths, (ii) linear and quadratic interpolation of the 
fundamental frequency, (iii) linear interpolation of the instantaneous pitch period. 
Whatever the domain, the interpolated pitch pulse lengths are calculated and the LP 
excitation is formed based on the relation between the lengths of the original and the 
reconstructed pulses. 

Finally, we have described spectral interpolation used in WI and TFI. The spectral 
interpolation is used in the PPE coder to interpolate waveshapes of the pitch pulses. 
The interpolation of the pulse waveshapes does not effect the interpolated pitch pulse 
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positions. The waveshape interpolation is performed in the frequency domain on the 
spectra of the underlying pitch pulses. 

The presentation in this chapter was based on the Fourier transform and the 
interpolation was performed in the frequency domain. Any other transformation 
determined to be more appropriate for a particular type of interpolation or spectral 
modification can be used; the Discrete Cosine Transform (DCT) is one of the viable 
alternatives. 


Chapter 5 

Implementation of the 4 kb/s PPE 
Coder 


The PPE model described in the previous chapters has been implemented as a 4 kb/s 
coder. The PPE coder is an analysis-by-synthesis based LP coder with emphasis 
on modelling the voiced LP excitation. The pitch pulse analysis is performed pitch 
synchronously and the unvoiced contribution to the LP excitation is coded with a 
fixed-block-length analysis. Individual pitch pulses are identified in the LP residual 
and the pulses are extracted. One pitch pulse position is encoded per frame and 
the intermediate pulse positions are determined via pitch pulse-length interpolation. 
A modified LP residual is created by shifting the original pitch pulses to the new 
positions and the modified speech signal is formed. 

The underlying pitch pulses are estimated based on the pulses extracted from the 
LP residual. The underlying pulses are decimated and one pulse per frame is coded 
in the frequency domain. The intermediate underlying pitch pulses are obtained via 
pitch pulse-shape interpolation. The noise contribution is coded with the generalized 
analysis-by-synthesis procedure (with respect to the modified speech signal). The 
gain is encoded as the total gain and the gain ratio between the underlying pitch 
pulses and the superimposed noise. 

A detailed description of the coder follows. 
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5.1 The Coder Structure 

The block diagram of the encoder is presented in Fig. 5.1. The encoder includes the 
following stages: 

o Linear prediction analysis : The LP coefficients are calculated and coded (Sec- 
tion 5.2). The LP residual is obtained by filtering the original speech using the 
unquantized LP coefficients. 

o Pitch pulse position analysis: The pitch pulses are extracted from the LP resid- 
ual. The extraction of the pitch pulses is executed (Section 5.3) and the pitch 
pulse positions are coded (Section 5.4). 

o Pitch pulse waveshape analysis : The extracted pitch pulses are amplitude nor- 
malized. The pulse gain is quantized and pitch synchronously interpolated 
(Section 5.5). 

The underlying pitch pulses are estimated and transformed to the frequency 
domain. The current underlying pitch pulse is predicted and based on the 
prediction, the waveshape of the current underlying pitch pulse is coded. The 
coded pulses are interpolated to render the intermediate underlying pulses (Sec- 
tion 5.6). 

o Noise analysis: The relative ratio between the noise component and the under- 
lying pitch pulse component is calculated and coded. The difference between 
the extracted pitch pulses and the coded interpolated pulses is placed as the 
unvoiced residual at the coded pitch pulse positions. The thus formed noise sig- 
nal is coded based on analysis-by-synthesis using a perceptual weighting filter 
based on the unquantized LP filter A(z) (Section 5.7). 

The following parameters are coded: 

A(z ) - The LP filter with coefficients coded in the LSF domain. 

P — Pitch positions and number of pitch pulses between frames. 

g ~ Total gain of the LP residual (calculated pitch synchronously). 

D(n) - Pitch pulse shape coded in the frequency domain. One extra bit is used 
to switch between differential and non-differential coding. 

N(n) - Noise shape in the time domain. 

fn/p ~ Relative gain of the superimposed noise. 
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The bit allocation of the 4 kb/s coder for all the coded parameters is presented in 
Table 5.1. In every frame, the LSF parameters, the pitch pulse positions and the total 
gain are coded. In frames with pitch pulses, the remaining bits are used for coding 
the pitch pulse shape, pitch pulse noise, and pulse to noise gain ratio. In frames with 
no pitch pulses, the remaining bits are used to code the noise shape. 


Table 5.1 Bit allocation in the 4 kb/s PPE coder. 



Bits/Samples 

Bits/Frame 

Bits/Second 

Update Rate 

Line Spectral Frequencies 

30 / 160 

30 

1500 

50 Hz 

Pitch Position 

8 / 160 

8 

400 

50 Hz 

Total Gain 

5 / 80 

10 

500 

100 Hz 


When Pitch Pulses Are Identified: 



Pitch Pulse Shape 

9 / 160 

9 

450 

50 Hz 

Absolute Shape/Pulse Difference 1 / 160 

1 

50 

50 Hz 

Pitch Pulse Noise 

4/40 

16 

800 

200 Hz 

Pulse to Noise Ratio 

3/ 80 

6 

300 

100 Hz 


When No Pitch Pulse Is Identified: 



Noise Shape 

8/40 

32 

1600 

200 Hz 

The Total Number Of Bits Used 

80 

4000 



The block diagram of the PPE decoder is presented in Fig. 5.2. The decoder 
includes the following stages: 

o Generating the pitch pulses: The shape of the underlying pitch pulse is created 
based on the predicted pitch pulse shape and the coded pitch pulse difference. 
The intermediate pitch pulses are formed by interpolation. 

o Pitch pulse positioning : The pitch pulses are placed at the coded pulse posi- 
tions. 

o Adding the noise component : The coded noise is added to form the LP excita- 
tion. The gain is pitch synchronously interpolated and applied to the excitation. 

o Adding the formant structure: The excitation is filtered with the LP synthesis 
filter to produce the coded speech. 
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Fig. 5.1 Block diagram of the PPE encoder* 
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Fig. 5.2 Block diagram, of the PPE decoder. 
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In the art of speech coding many extra components like pre- and post-processing 
are used to improve the quality of the coded speech. At present, we do not use any 
additional techniques in our coder. We have concentrated on the proper implemen- 
tation of the features which are unique to our coder and on demonstrating that we 
can achieve a very high quality speech without the help of these extra components. 

5.2 Linear Prediction Analysis and Coding 

The 10-th order LP linear prediction analysis is performed every 20 ms with the 
autocorrelation method. We use a Hamming window 1- of length 240 ms centered 
on frame boundaries. The LP coefficients are calculated with the Levinson-Durbin 
recursion (Markel and Gray 1976, Rabiner and Schafer 1978). We use bandwidth 
expansion* in which the LP coefficient a n is multiplied by 7 71 with 7 = 0.977. The LP 
coefficients are converted to the LSF domain with the method described in (Kabal 
and Ramachandran 1986). 

In our bit budget for the 4 kb/s coder we allocated 30 bits per frame for coding 
the LP parameters. In our initial implementation of the LSF codebooks however, 
we assumed coding the LP parameters with 24 bits per update. Subsequently all 
the testing presented in this chapter has been performed with LP parameters coded 
with only 24 bits/frame. We used two split codebooks, one for the first 4 LSFs and 
the other for the last 6 LSFs, trained with the Generalized Lloyd Algorithm (GLA) 
(Gersho and Gray 1992) on a database which did not include the test sentences. 
The error weighting used in the quantization of the LP parameters is as suggested in 
(Paliwal and Atal 1993). The training of the codebooks as well as the development 
of the software used in the training are part of our work. 

Since we use only 24 bits/frame for coding the LP parameters, the actual bit rate 
of the implemented coder is 300 b/s lower than 4 kb/s (i.e., 3.7 kb/s). These bits 
could be reallocated to improve other aspects of the coder. Time did not permit im- 
plementation thereof within the scope of this work. The vectors of unquantized LSFs 
are linearly interpolated with an up-sampling rate of 10 and converted back to LP 

f We note that modem coders have adopted non-symmetric windows which help in reducing 
the algorithmic delay of the coder. Our focus in this thesis is on the pitch pulse coding - any 
improvements to the LP analysis will also help PPE. 

^Bandwidth expansion has been shown to avoid unnaturally peaky formant structure and to help 
reducing quantization cross-overs of dosely spaced LSFs. 



88 


Implementation of the 4 kb/s PPE Coder 


coefficients. The LP residual is calculated by inverse filtering using the interpolated 
unquantized LP coefficients. 

5.3 Pitch Pulse Extraction 

As explained in Section 3.2, the extraction of pitch pulses is viewed as a problem 
of appropriate segmentation of the LP residual. The segmentation is based on the 
minimization of the prediction error between (i) model pulses (underlying pulses of 
the simplified estimation) (ii) the noisy pulses of the LP residual. 

The implementation details are presented in this section. The constants used in 
our algorithm are specified in Table 5.2. Except for Np and F ups , all the integer values 
in the table are given in samples for the 8 kHz sampling rate. The values marked 
with an asterisk are subject to up-sampling rate F ups , which in our coder is equal to 
eight. 


5.3.1 Frame Classification 

The residual is divided into frames of length Lp. The boundaries of frame k are 
written as t k and tk+i, 

rit+i = t k Fp . (5.1) 

Four types of frames are identified: 

noise frame - no pitch pulses, 
start frame - the pitch pulses start, 
continue frame — the series of pitch pulses continues, 
end frame - the series of pitch pulses ends. 

The following transitions between frames are allowed: 

noise frame — > noise or start frame, 
start frame — ¥ continue or end frame, 
continue frame — ► continue or end frame, 
end frame — ¥ noise or start frame. 

For the purpose of comparing alternative segmentations, four errors are deter- 
mined in every frame: 
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Table 5.2 The constants used in the pitch extraction algorithm. The 
values marked with an asterisk are subject to up-sampling rate F upS) 
which in the described coder is equal to eight. 


Symbol 

Value 

Description 

Lp 

160* 

Frame length 

P min 

20* 

Minimum pitch pulse length 

P max 

150' 

Maximum pitch pulse length 

fine 

0.7 

Noise error scaling coefficient 

N P 

2 

Maximum number of peaks in a window of length Lp 

Lp 

20’ 

Length of the window in which at most Np peaks are allowed 

L Ps 

10’ 

Length of the shadow cast by a peak 

P offs 

10’ 

Pitch pulse offset 

Lf 

240* 

Length of the segmentation window in the start mode 

Lf 

320* 

Length of the segmentation window in the continue mode 

$] Pf 

0.3 

Maximum relative pulse length decrease (0.3 means by 30%) 

^P! 

0.3 

Maximum relative pulse length increase (0.3 means by 30%) 

a P 

0.5 

Estimation filter coefficient 

P align 

8* 

Maximum pitch pulse alignment offset 

Le o 

5* 

Start sample for the Pitch pulse energy shaping 

Le 

10* 

Length of the pitch pulse energy shaping window 

Dpred 

60 

Maximum dimension of the model pitch pulse selection 

D align 

60 

Maximum dimension of the pulse alignment 

Lp e 

40* 

Maximum frame extension for frame error calculation 

5pe 

0.9 

Minimum normalized prediction error 

Pups 

8 

Up-sampling rate 
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e (n) - noise frame error, 
e (s) - start frame error, 
e (c) - continue frame error, 
e (e) - end frame error. 

The errors are calculated with a one frame look-ahead. 

The cumulative error of the present and the next frame is calculated and, based on 
this error, the type of the present frame is determined. If the last frame was identified 
as noise or end we compute: 

e n = ef + mm(e£k, eg*) , (5.2) 

e 3 = e^ + min(eg. lt eg*) . (5.3) 

The current frame is classified as a noise frame if e n < £ s and as a start frame 
otherwise. If the last frame was identified as start or continue we calculate: 

e c = eg + min(eg* , eg*) , (5.4) 

e e = eg + min(eg*, eg*) . (5.5) 

The current frame is classified as a continue frame if £ c < £e and as an end frame 
otherwise. 

The errors e (n> , e (,) , e Cc) and e M are computed based on the segmentation of the 
residual and the identified model pitch pulses. For the calculation of e (n) there is no 
segmentation; the errors e (s> , e Ce) and e (t) correspond to the segmentations obtained 
in the start, continue and end mode, respectively. 

In determining the type of the current frame, the errors based on the segmenta- 
tion of the next frame are used. The pitch pulse extraction look-ahead beyond the 
current frame would normally be equal to Lg (the length of the longest segmenta- 
tion window). To decrease the look-ahead, first the segmentations for the next frame 
are performed with the lengths of the segmentation windows limited to frame length 
Lf. This may result in a different segmentation than for longer windows, but since 
those segmentations are used only in error calculations they are not crucial. When 
the present “next frame" becomes the current frame, the segmentations are redefined 
based on the longer segmentation windows. 
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5.3.2 Error Calculation 

The Error Between Pitch Pulses 

The error between a model pitch pulse and a pulse identified in the LP residual is 
specified by the prediction error between the pulses. The prediction error between 
vectors x and y is defined as 


£ P (*, y) = min(y - /3a;) 2 . (5.6) 

p 

We want to predict the LP-residual noisy pulses from the model pitch pulses. In this 
formulation we predict vector y from vector x and hence y corresponds to the noisy 
pulses and x corresponds to the model pulses. 

The optimal prediction gain /3 which minimizes the error is calculated as 

X T u 

PoPt = ‘ 

With this choice of beta, the prediction error is given by 



£ P (*> y) = y T y - P 0P tx T y • 

(5.8) 

Writing 

C(x, y) = x T y and E(x) = x T x , 

(5.9) 

we obtain 

Po p t= and £p(x,y) =E(y) f3 op tC{x,y) 

(5.10) 

If the vectors x, y are normalized in amplitude, i.e., E(x) = 1 and E(y) 
above equations simplify to 

= 1, the 


P op t = C{x , y) and 8 p (x, y)=l~ p % t . 

(5.11) 


This formulation assumes that vectors x and y are of equal length. If a; is shorter 
than y, we extend the vector * with zeros so that both vectors have the same length. 
If * is longer than y, we truncate the vector x to the length of the vector y. Note 
that it is always vector x which is extended or truncated. The vector y is predicted 
from x and so it keeps its initial length. 
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Since the prediction error is applied to the pitch pulses and we assume that the 
pulses are well correlated, we limit f3 to positive values. If 8 opt < 0 we set j3 opt =0. 

The Noise Error 

The region in which no pitch pulses are identified is considered as “unvoiced only”. 
Conceptually, the prediction error of an “unvoiced only” region is equal to the energy 
of the signal. We introduce a correction factor <S„ e , which is called the noise error 
scaling coefficient. We calculate the error of an unvoiced region as the energy of the 
region scaled by S ne . 

The value of 5 ne affects our acceptance of a segmentation as a series of pitch pulses. 
If <5„ c = l, any segmentation with positive correlation between the model pulses and 
the segments of the residual signal will be considered as a series of pitch pulses. Also, 
once the series starts it might never end. We want, however, to identify the beginning 
of a series to be able to correctly specify and align the pulses. If 5 ne =0, none of the 
segmentations will be good enough to be considered for a pitch pulse series. We 
tested values of 5 ne in the range from 0.6 to 0.9. The influence of 5 ne on the frame 
classification is weakened by the calculation of the cumulative error of two frames and 
by frame extension, which is explained later in this section. We got very good results 
using S ne = 0.7. 

5.3.3 Segmentation of the LP Residual 

The pitch pulses are defined by a segmentation of the LP residual. The series of pitch 
pulses form the voiced regions of the residual*. The segmentation is carried out in 
three modes: start , continue and end. For frame k, the three modes provide the pulse 
markers rjfj , rjfj and rf£} . The number of pulse markers for the frame identified in 
a particular mode is written as N where 0 < i < N. The iV-hl markers delimit N 
complete pulses. 

The pitch markers rff), rj^ and rj$ designate the positions of consecutive pitch 
pulses. To make our presentation more readable we will drop, for the time being, 
the superscripts W, and * e L We will use them only when we need to differentiate 

t As noted earlier, in the PPE model the voiced regions contain a noise component which is here 
the difference between the estimated model pulses and the LP residual. 
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between the parameters and the pitch pulse markers specific to, or obtained in, a 
particular segmentation mode. 

To simplify the notation we will also drop the subscript k on the pulse position 
markers r and the corresponding pitch pulse vectors. It should be clear from the 
context that they belong to the segmentation performed for the frame k. What 
should be written as is now simply r t -. 

In our notation a continuous block of samples of the residual r beginning at t 
and ending at t' which includes the sample r t but does not include the sample r t > is 
written as r [t : t’). The pitch pulse between marks r,- and r i+l is written as Pi, 

Pi-T [r* : r i+ i) . (5.12) 


The length of the pulse Lp ( is given by 

L Pi = n+t - n . (5.13) 

The model pitch pulses are written as g*. The dimension of the vector g is always 
extended to P max . When the vector g is formed from shorter vectors they are padded 
with zeros to length Pmax • 

Candidate Pitch Pulse Positions 

We determine all possible positions of the pulses based on the energy maxima of 
the LP residual signal. No more than Np energy maxima for every Lp samples are 
accepted. Each identified maximum casts a forward “shadow” of L Ps samples within 
which no smaller energy peak is recognized. A larger maximum resets the “shadow 3 ’ 
and casts one of its own. An energy maximum is accepted as a candidate pitch 
pulse position if it is one of the N P largest energy peaks within ±Lp samples; the 
candidate pulse position is set at P 0 ff S samples prior to the energy maximum. In the 
segmentation of the LP residual, the pitch pulse positions r* are chosen from the set 
of the candidate pitch pulse positions. 
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The Model Pulses 

Every pulse p, has a corresponding model pulse g,. Ideally, the model pulse qr* should 
be some estimate of the underlying pitch pulse. At this stage of the process, we 
use simplified model pulses to obtain a segmentation. The pulses defined by the 
segmentation are used in the underlying pulse estimation procedure which is invoked 
later. The model pulse qi is one of the following: 


o The previous pitch pulse vector Pt_i. 
o The next pitch pulse vector p, +1 . 

o The average of the previous and the next pulses p'j, given by 

_/ Pi- 1 + Pi+l 

P, = 2 • 

o The average of the past pitch pulse vectors q\ given by 


(5.14) 


q\ = (1 -at P )q i -2 + OL p pi- l . 


(5.15) 


The coefficient a p is the weight of the last noisy pitch pulse p t _ t . If the model 
pitch pulse qi -2 is equal to Pi-i, the above would render q' { equal to Pi-i- In 
this case we set q[ as 


Qi = (1 — Oip ) Pi-2 + exp Pi-1 . (5.16) 

The vector chosen for the model pulse qi is the one which minimizes the prediction 
error with respect to the vector p t -. We have 

q { = argmin £ P (x,p t -) (5.17) 

X € (Pi-i , Pi+i , Pi , } 

with q'i and p' as specified above. 

At the beginning of a frame, the past pulses q_ 2 , p_ 2 and p_i are taken from the 
previous frames. In the start mode the past pulses come from the last end frame; this 
“memory 3 ’ of pitch pulses was found useful in identifying the beginning of a series. 
Note that between a start frame and an end frame there might be intervening noise 
frames. In the continue and end mode the pulses q_ 2 , P- 2 , P-i come from the last 
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start or continue frame. 

Energy Shaping of The Model Pulses 

To prevent misidentifying two pitch pulses as a single pulse, the model pulse q is 
energy shaped. The shaping is such that after the initial increase in the pulse energy, 
the energy may only decrease. If there are two pulses in the vector q, the energy of 
the second pulse will be attenuated to the energy of the signal between the two pulses. 
The energy attenuation increases the prediction error between the model pulse q and 
the corresponding pulse p, so that the total prediction error of a double-pitch-pulse 
segmentation is large and the segmentation is rejected. 

The energy shaping starts at sample L Eo from the beginning of a pulse. First, an 
energy maximum i? max at sample n max is found such that the next Lb samples have 
smaller energy. Then, starting at n max + L E , the pulse vector is searched for a sample 
with higher energy than E max . If at sample n' max the energy E' mSLX is larger than E max , 
the rest of the vector is scaled by E max /E' max . The search for a sample with energy 
higher than E max continues starting at n' max and the vector is modified again if such a 
sample with higher energy is found. When the end of the vector is reached, the whole 
procedure is repeated starting at sample n max +l (i.e., find an energy maximum E m3iX 
such that the next L E samples have smaller energy...). 

Pulse Alignment 

We want to align the vector pi with respect to the model pulse vector q t . To do 
this, we fix the length of the pulse p t -, Lp t , and we shift 7* to reduce the prediction 
error between g, and the shifted pulse Pi. The pulse p t of length Lp i starting at r,- is 
equivalent to the segment of the LP-residuals given as r [r,- : Ti+LpJ. The beginning 
of the aligned pulse is given by 

n = argmin £ v ( q, , r [t £ : n+Tp.) ) . (5.18) 

T{—P align ^ ^ a Ug n 

The constant P a u g n specifies the largest shift allowed when aligning the pulses. 
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The Segmentation Algorithm 

So far we have defined the model pitch pulse g, as one of the following: p t -_i, p, +1 , 
p' { or q'i. The vector p' is the average of the vectors p,_i and p i+1 ; the vector 
is specified by (5.15)~(5.16). In fact we do not know the next pulse, Pi+i, when we 
want to establish the vector g t - and align the vector p*. To overcome the problem the 
segmentation is done with the following procedure: 

1. Set i = 0 and establish the vectors p_i and g_i- 

2. Determine the pitch pulse candidate p,. 

3. Calculate the average of the past pulses to form the vector q' as specified 
in (5.15)-(5.16). 

4. Find an initial estimate of the model pitch pulse based only on the past pulses. 
The initial model pulse g£^ is one of the vectors p,_i, g'. Choose the vector 
which minimizes the prediction error with respect to p*. 

5. Align the vector pi with respect to g^. 

6. If i > 0, reestimate the model vector corresponding to the last pulse g,_i. The 
model pulse q^i is one of the vectors: g,-L\, p, and p'_ x equal to (pi_ 2 +p,)/2. 
Choose the vector which minimizes the prediction error with respect to p,_ \ . 

7. If the end conditions of the current segmentation mode are satisfied, end the 
segmentation. Otherwise increment i and continue starting from (2). 

Note that the alignment is always carried out with respect to the past identified 
and aligned pulses. 

This segmentation algorithm is used in all three modes; start, continue and end. 
In every mode all viable segmentations are tested. The modes differ in (i) the condi- 
tions the first and the last pulse must satisfy and (ii) the length of the segmentation 
windows. The segmentation in a particular mode is chosen if it minimizes the seg- 
mentation error £ seg . The error e seg is given by 

N-l 

£seg 

i= 0 


(5.19) 
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where Eb and e e are the errors at the beginning and at the end of the segmentation 
window respectively. The calculations of the errors and £ e differ in the three 
segmentation modes. 

Segmentation in the Start Mode 

In the start mode we segment the residual of length L ^ from t s to ts> with 

ts = tk and t S ' =t s +Ls . (5.20) 

The segmentation window L ^ should be at least the size of the frame, Lp. 

The beginning of the first pulse must be within the frame boundaries. Every 
candidate mark To 6 {r} such that 


t/c < To < t k+l (5.21) 

is considered as a possible start of a series of pitch pulses. The beginning of the 
second pulse Ti (which is also the end of the first pulse) must lie within the distance 
of P min to P max from r 0 , 


Tq + Pmin ^ Ti < Tq + P max . (5.22) 

The beginning of every following pulse r t -, z > 1, must be within the limits set by P min , 
Pmax and the maximum allowed pitch pulse length change. We have 

ri_!+ max (l-tip t )Ip,._ 1 ) < n < r^+min ({l+Sp^Lp i_L r Pmax ) (5.23) 

where and specify the maximum allowed relative decrease and increase of the 
pitch pulse length. 

We stop the segmentation at i=N when n+i > ts>- We have identified N + 1 pitch 
pulses in which one pulse extends beyond the end of the segmentation window. The 
segmentation error e 3eg is calculated as in (5.19) with 

Eb = 8 ne E(r [t Sl r 0 )) , 

s e = E(r [rtf, tg') — /3tf<Ztf) 


(5.24) 

(5.25) 
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where fix is the prediction gain between vectors and p/v- The vector is 
truncated to length ts'—r, y. The segmentation which minimizes £ seg is the start mode 
segmentation accepted for the current frame. 

In the start frame, the boundaries of pitch pulse segments are further repositioned 
so that the signal energy on the pulse boundaries is minimized. The maximum allowed 
shift is equal to the maximum shift permitted during pulse alignment P a u gn • The 
repositioning is done only in the start frame. 

Segmentation in the End Mode 

In the end mode the beginning of the first pulse tq and the beginning of the segmen- 
tation window ts are set to the beginning of the last pulse of the previous frame. The 
first pulse is not subject to alignment. The candidates for the beginning of the next 
pulse have to satisfy the relation (5.22) if r 0 is the beginning of the first pulse in the 
series, and relation (5.23) otherwise. In addition the beginning of the second pulse 
must be within the current frame boundaries, ifc < ri < t k+ i- 

In the end mode the segmentation window extends only until the end of the frame, 

ts> = tk+i • (5.26) 


The end of the last pulse t# must be within the segmentation window, 


T /v < t-S' . 


(5.27) 


The segmentation error is given by (5.19) with 

£b — 0 and e e =S ne E(r[r t /,t S ')) - (5.28) 

The segmentation which minimizes e seg is chosen as the end mode segmentation of 
the current frame. 

Segmentation in the Continue Mode 

In the continue mode the conditions for the beginning of the pulses and the beginning 
of the segmentation window ts are the same as in the end mode. The end of the 
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segmentation window is specified by L ^ and is given by 

t s . = t k + Lf ■ (5.29) 

The end of the last pulse tjv+i must extend beyond the current 
and the segmentation error is computed as in (5.20) with 

£6 = 0 , 

^p(^T/v , ) H- <5 ne jE? (t* , isO) tf ^v+1 

E{r [TH,ts>) ~PnQn) if T V+l 

where /3 jv is the prediction gain between vectors and p/v and the vector qx is 
truncated to Length ts'-r^. Again, the segmentation which minimizes this error is 
chosen as the continue segmentation of the frame. 

Frame Error Calculation 

Based on the best segmentations in the three modes, the frame errors are calculated. 
The frame is extended so that the frames overlap. This is done to reduce the effects 
of unfavorable frame positioning. Every frame is extended on each side to the closest 
boundary of the pitch identified in the continue mode, but by no more than 2 Lp e . 
The extended frame k is specified by: 



frame, ^ tk+ 1 > 

(5.30) 

(5.31) 


< ts r 
> ts> 


4 = max(r,S c) , t k - L F J , (5.32) 

4+i = min(r{ c) , t k +i + L Fe ) (5.33) 


where 



min 

r- c) > t k +i 



(5.34) 


The error e (a) is set to the energy of the extended frame scaled by S ne . The 
errors e (s) , e (c) and e (e) are calculated based on the corresponding segmentations ( start , 
continue, end ) applied to the extended frame. 
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5.3.4 Computational Savings 

For long pitch pulses the tails of the pulses are not as well correlated as the initial, 
high energy part. We therefore reduce the computational complexity of our extraction 
method by limiting the dimension of the prediction error calculation. The dimension 
of the error calculated while choosing the model pitch pulse is the minimum of £p. 
and D pTa i . The error dimension computed while aligning the pulses is the minimum 
of Lp { and D a u gn . The segmentation error and the frame error, however, are always 
calculated with the length Lp i after the length is established by aligning the next 
pulse. 

The maximum number of segmentations which might have to be considered is still 
large. For the beginning of the first pulse in a series we consider N\ candidates, 

Ni <L f ^. ( 5 . 35 ) 

L P 

For the beginning of the second pulse of a series we try N 2 candidates, 


N 2 <{P max -P min )^. 

Lp 

For the pulse i, i>l there are IV,- candidate positions, 

jVp N P 

Pmin(5 PT + ^P| ) 7 — < Nt < Pmax{8 Pf + . 

Lp Lp 

The total number of segmentations to be considered Ns is given by 

iV5=n>i 

i = 0 


( 5 . 36 ) 


( 5 . 37 ) 


( 5 . 38 ) 


with 


Ls <i< Ls 


( 5 . 39 ) 


p . — — p 

1 mm x max 

where Ls is the length of the segmentation window. 

Reducing by one the number of candidate positions for the pulse j eliminates N e 
possible segmentations, 

N e = n N i- 

i—j+l 


( 5 . 40 ) 


5.4 Coding the Pitch Pulse Positions 


101 


The smaller the j, the more the computational savings. 

We reject a segmentation at pitch pulse j if 

- after the alignment of the next pulse the length of the pulse j is smaller than 
P min ^ larger than P maxi 

- the normalized error between the model pulse q j and the pitch pulse Pj is larger 
than the threshold 5 pe . 

The rejection of segmentations reduces the computational complexity of the imple- 
mented method significantly. 

The residual signal is up-sampled by a factor of F ups and the pulses are aligned with 
the up-sampled resolution, but for the purpose of calculating the errors the pulses 
are decimated to the original 8 kHz sampling rate. Note that the down-sampling 
does not, in general, produce the original signal because of a possible fractional shift 
introduced by the alignment. 

In the alignment procedure, first a rough alignment is performed with the 8 kHz 
resolution and then a fine alignment is carried out around the rough estimate. 


5.4 Coding the Pitch Pulse Positions 


The pitch information is encoded once per frame. The coded parameters are: (i) a pitch 
pulse position r fc , (ii) the number of pulses between this and the last coded pulse po- 
sition, Nk pulses between r k -i and r k . 

As in Section 5.3, the four types of frames are handled differently: noise frames - 
frames with no pulses; start frames - frames in which pulses start; continue frames - 
when pulses continue; end frames - frames in which pulses end. 

The number of pulses between frames is coded indirectly. First, an expected 
number of pulses iV fc is calculated, and then a correction value C k is determined. If 
ffc_L is the position quantized in the last frame, Pk-i is the average pitch length of 
the last frame, and ft is the quantized position of the current frame, the expected 
number of pitch pulses in this frame is: 


N k = 


h ~ fk-i 
Pfc-L 


(5.41) 
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The correction value C* is given by 



0 

for 

0.0<|iV*-lV fc | 

< 0.5 


1 

for 

0.5<\N k -N k \ 

< 1.0 

c fc = < 

2 

for 

1.0 < \N k - Afc| 

< 1.5 


3 

for 

1.5 < \N/e — IVjfcl 

< 2.0 


(5.42) 


The decoder calculates the expected number of pulses iV* and, based on the coded 
correction value C*, determines the number of coded pulses. 

If a continue frame follows a start frame, the encoding strategy is different. In 
this case there is no Pk-i available. For a continue k-th. frame following a start frame, 
the number of pulses, lV fc , is calculated as 


N k = C k M h _ : +C k - l: (5.43) 

where C* and Ck-i are the coded correction values for the current and for the last 
frame, and iV/ fk _ l is the maximum correction value allowed at the quantized start 
position Given and IV* the correction values are calculated as 

C ‘-t = M mod (5 44) 

C* = 1%/MK.J . 

If an end frame follows a start frame, the coding strategy is the same as when 
continue follows start. 

In every frame, the pitch pulse positions and the number of intermediate pulses are 
specified by an index to a pair of values in a quantization table. Each pair of values 
contains two integers: the first integer specifies the quantized pitch pulse position, 
the second integer is the correction value based on which the estimated number of 
intermediate pitch pulses is modified. 

A quantization table is specified by integer pairs with each pair describing one 
allowable pitch pulse position. The set of first elements of the pairs determines the 
permitted quantized positions in the current frame. The second element is the number 
of correction values assigned to the position described by the pair. The number of the 
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codewords represented by a quantization table is equal to the sum of the correction- 
value numbers. 

We code the pitch pulse position information with eight bits, which allows us to 
specify 256 codewords. Two quantization tables are used: one for a start frame, and 
one for a continue/end frame. In both cases one codeword is reserved for “ noise 
frame” information. This reduces the number of available codewords to 255. 

The quantizing tables used in the implemented coder are given in Table 5.3 and 
Table 5.4. With these tables the beginning of a first pulse is quantized with three- 
sample resolution, a pulse position in a continue frame is quantized with two-sample 
resolution, and the end of a last pulse is quantized with five-sample resolution. The 
coded residual is synchronized with the original at least once per frame within two 
samples in the start frame, one sample in the continue frame, and three samples in 
the end frame. This relaxed synchronization was found to perform very well and the 
semi-synchronized speech was judged to be perceptually equivalent to the original 
(see Section 5.8). 

5.4.1 Choosing the Pitch Pulse Position to Code 

There may be a number of pitch pulses in a frame and we can choose which pulse 
position to code. One of the strengths of our coding method is that we have this 
choice. Some parameters (pitch pulse shape, pulse length, pulse gain) are interpolated 
between the coded positions; so if we choose the position at which the change in these 
parameters is large we can gain a coding advantage. 

Each pulse position in a frame is first tested if it is “coding valid” . A pulse position 
at time r is “coding valid” if the correction number resulting from coding this position 
is smaller than or equal to the maximum correction number allowed at this position, 

C k (r)<M T . (5.45) 

In our implementation we code the position which would result in the smallest 
maximum change of the coded (and interpolated) pitch pulse lengths with respect to 
the lengths of the original pulses. The k-th. frame boundaries are written as t Fk and 
t Fk+l . The last coded position r k is equal to t,-, and 


T fc < t i+j < t Fk for 0 < j < N Fk 


(5.46) 
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Table 5.3 Pitch, quantizing table used in the start frame. 


(0,6) 

(3,6) 

(6,6) 

(9,6) 

(12,6) 

(15,6) 

(18,6) 

(21,6) 

(24,6) 

(27, 6) 

(30,6) 

(33,6) 

(36,5) 

(39, 5) 

(42,5) 

(45, 5) 

(48,5) 

(51,5) 

(54,5) 

(57,5) 

(60,5) 

(63,5) 

(66, 5) 

(69,5) 

(72,5) 

(75,5) 

(78,5) 

(81,4) 

(84,4) 

(87,4) 

(90,4) 

(93,4) 

(96,4) 

(99,4) 

(102,4) 

(105,4) 

(108,4) 

(111,4) 

(114,4) 

(117,4) 

(120,4) 

(123,4) 

(126,4) 

(129,4) 

(132,4) 

(135,4) 

(138,4) 

(141,4) 

(144,4) 

(147,4) 

(150,4) 

(153,4) 

(156,4) 

(159,4) 



Table 5.4 Pitch quantizing table used in the continue/end frame 


Continue part: 


(0,2) 

(2,2) 

(4,2) 

(6,2) 

(8,2) 

(10, 2) 

(12, 2) 

(14,2) 

(16,2) 

(18,2) 

(20,2) 

(22, 2) 

(24,2) 

(26,2) 

(28,2) 

(30,2) 

(32,2) 

(34,2) 

(36,2) 

(38,2) 

(40,2) 

(42,2) 

(44, 2) 

(46,2) 

(48,2) 

(50,2) 

(52,2) 

(54, 2) 

(56, 2) 

(58, 2) 

(60,2) 

(62,2) 

(64,2) 

(66,2) 

(68,2) 

(70,2) 

(72,2) 

(74, 2) 

(76,2) 

(78,2) 

(80,2) 

(82,2) 

(84,2) 

(86,2) 

(88,2) 

(90, 2) 

(92,2) 

(94,2) 

(96,2) 

(98,2) 

(100,2) 

(102,2) 

(104,2) 

(106, 2) 

(108,2) 

(110,2) 

(112,2) 

(114,2) 

(116,2) 

(118,2) 

(120,2) 

(122,2) 

(124,2) 

(126, 2) 

(128,2) 

(130,2) 

(132,2) 

(134,2) 

(136,2) 

(138,2) 

(140,2) 

(142,2) 

(144,2) 

(146,2) 

(148,2) 

(150,2) 

(152,2) 

(154,2) 

(156,2) 

(158,2) 




End 

part: 




(1-2) 

(6,3) 

(H,3) 

(16,3) 

(21,3) 

(26,3) 

(31,3) 

(36,3) 

(41,3) 

(46,3) 

(51,3) 

(56,3) 

(61,3) 

(66, 3) 

(71,3) 

(76, 3) 

(81,3) 

(86,3) 

(91,3) 

(96,3) 

(101,3) 

(106, 3) 

(111,3) 

(116,3) 

(121,3) 

(126,3) 

(131,3) 

(136,3) 

(141,3) 

(146,3) 

(151,3) 

(156, 3) 
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< ti+j < tF k+l for N Fk <j< Np k+1 . (5-47) 

The selected coded position in this frame r k is equal to t i+m , where 

m= argmin ( max |P(t i+ ,) — P{ti + j )\ ) . (5.48) 

N Fk <n<N Fk+l 0<i<n 

We have found that for a continue frame, the maximum correction value of 1 is 
sufficient. For long pitch pulses there are fewer pulses for coding to choose from, 
but the correction number of 1 means a very large change in the pulse lengths. For 
shorter pulses there are more pulse positions to choose from, and therefore there are 
always few pulses which are “coding valid” . 

5.4.2 Pitch Pulse Length Interpolation 

In the process of pitch pulse length interpolation, the block of T k = r k - r k - 1 samples 
is segmented into N k pulses. The original lengths of the pulses are P(t i+ j), with 
ti = Tfc_i and 0 < j < N k . The new pitch pulse lengths are 0 < j < 

N k . The segmentation is determined by the applied pitch interpolation technique 
(Section 4.1.4). 

We implemented the interpolation directly on the pitch pulse lengths P(t) (see 
Section 4.1). The linear interpolation described in Section 4.1.4 does not deal with 
the constraint of P(t i+ j) being integer. With this constraint 

Pfa+s. ) “ HU-h-i) ± const . (5.49) 

We calculate the integer pitch pulse lengths P (£{+_,), 0 < j < N k , so that the variation 
in the pitch pulse length differences is minimized. For 

c(j) = P(t i+j ) - P(t 1+i -i) , 0 < j < N k , (5.50) 

we want to minimize ^ ^ 

d c = E (5.51) 

3 = 1 

The algorithm used for segmenting the block of T k samples into N k pulses so that 
d c is minimized is presented in Appendix A. 
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5.5 Coding the Gain 

The total gain of the residual is coded with 5 bits, twice per frame, in the log domain. 
In frames with no pitch pulses, the gain is calculated every 10 ms (80 samples). The 
difference with respect to the last quantized gain is encoded. The coded gain is 
applied to the second half of the 10 ms gain sub-frame. The gain in the first half is 
interpolated from two adjacent coded gains. This insures that there are no abrupt 
changes in the gain envelope. 

In frames with pitch pulses, the gain is coded pitch synchronously. For every 
frame, we code the gain of the pulse which ends at the encoded pulse position. The 
other coded gain depends on the number of pulses between the coded positions, 
i.e., the number of interpolated pulses. If there is an odd number of pulses to be 
interpolated, we code the gain of the middle pulse. If there is an even number of 
pulses to be interpolated, we code the average gain of the two middle pulses. If there 
are no pulses to be interpolated, we have two gains corresponding to the same coded 
pitch pulse. In this case the second gain becomes a refinement of the first one so 
that the gain of this pulse is more accurate. In a start frame this may improve the 
adaptability of the coder to large energy changes at voiced onsets. The gains of the 
pulses whose amplitudes were not coded are linearly interpolated from the gains of 
the adjacent pulses. The interpolation is linear in the number of pulses and not in 
time, which means that the interpolation coefficients do not depend on the pitch pulse 
lengths, but only on how many intermediate pulses are between the coded pulses. 

Special care is given to frames where the pulses start and end. In a start frame, 
we code the gain of the noise prior to the first pulse and the gain of the first pulse. In 
an end frame, we code the gain of the last pulse and the gain of the noise following 
it. The coding is still differential but a slightly coarser quantization table is used to 
enable a faster build-up of the energy at voiced onsets, and a quicker die-off when the 
voiced region ends. There is no gain interpolation between a gain corresponding to a 
pitch pulse and a gain corresponding to noise. 

In calculation of the noise, a window of at least 10 ms is used. This means that 
if the pulses start within 10 ms of the frame beginning or end within 10 ms of the 
fr am e end, the gain of the noise will be calculated using some samples from the last 
frame or from the next frame. This was found necessary for obtaining a smooth gain 
envelope. 
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5.6 Coding the Shape of the Pitch Pulses 

The pitch pulse shapes of the underlying pitch pulses are coded. One of two types 
of coding is used: (i) differential coding with respect to the predicted pitch pulse 
shape, or (ii) direct coding of the underlying pitch pulse. For the two coding types, 
two different codebooks are used; one bit specifies the coding method selected. The 
coding is carried out in the frequency domain and one pulse per frame is coded, and 
the intermediate pulses are interpolated from coded pulses. 

Predicting the Underlying Pitch Pulse 

At present, the predicted underlying pitch pulse is simply the last coded pulse. Re- 
liable prediction of the current underlying pitch pulse from the past coded pulses is 
one of the suggested topics for future work. 

Estimation of the Underlying Pulses 

The extracted pitch pulses are normalized and the underlying pitch pulses are esti- 
mated with the algorithm described in Section 3.3.4. The weighted average algorithm 
is used with the error weight cj =0.7. The algorithm is initialized with the underlying 
pulses equal to the extracted pulses. In a start frame the previous underlying vector 
vq is unknown; in a continue and end frame the previous underlying pulse is set 
to the pulse which was coded in the last frame. The pulses used in the estimation 
procedure are extracted from a block of the LP residual starting at the pulse position 
coded in the last frame, and ending at the end of the current frame. 

Coding Pitch Pulses 

First, the underlying pulses are transformed into the frequency domain with a fixed 
length DFT using an FFT algorithm. Then, a linear fit is applied to the set of 
the underlying pulses. The linear fit is applied to the pulses so that the intermediate 
pulses are reconstructed (via linear interpolation) with minimum error. The weighted 
mean square error linear fit is described in Appendix B. The pulse which ends at the 
encoded pitch pulse position is coded. In the coding of the pitch pulses we use one of 
two codebooks. The first codebook CB P contains sample spectra of underlying pitch 
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pulses of various speakers. The second codebook CBq contains differences between 
pulse spectra. 


Training of the Codebooks 

In training the codebook of pitch pulse spectra CB P , we used pitch pulses of male 
speakers with pulse lengths larger than 5 ms. A spectrum of an underlying pulse 
was included in the selection set if its normalized correlation with respect to one of 
its neighbours was larger than 0.8. The selection set was first used as a training set 
for the GLA algorithm to compute an initial set of codebook vectors. The selection 
set was searched for the pulses which maximize the correlation with respect to the 
initial codebook vectors. The codebook CB P was populated with these pulses from 
the selection set. This was done so that the codebook CB V would contain true spectra 
of estimated underlying pulses and not an average between them. In particular we 
wanted to avoid averaging underlying pulses coming from different voiced segments. 

The second codebook CBq contains differences between spectra. The difference 
between the spectra of two underlying pulses was included in the training set if 
the corresponding extracted pulses were less than 20 ms apart and the normalized 
correlation between them was larger than 0.5. Note that the criteria were applied to 
the extracted pulses but the difference was taken between the spectra of the underlying 
pulses. The CBd codebook was also trained with the GLA algorithm. 


Searching the Codebooks 


In the coding procedure first the codebook CB P is searched for the entry which mini- 
mizes weighted mean square error with respect to the target spectrum. The weighted 
mean square error is specified by two fixed weights. The weight w a (f) emphasizes 
the importance of the frequencies around 1 kHz and is given by 


0.1 

for 

0 < |/| < 300 

0.5 

for 

300 < |/| < 600 

1.0 

for 

600 < |/| < 1200 

0.4 

for 

1200 < |/| < 2400 

0.1 

otherwise. 


(5.52) 
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The weight w n (/) is used to deemphasize the phase of higher frequencies and is 
specified by 


Wri(f) 


0.4 

for 

0 < l/l < 

300 

0.6 

for 

300 < |/| < 

600 

0.4 

for 

600 < |/i < 1200 

0.2 

for 1200 < |/i < 2400 

0.0 

otherwise. 



(5.53) 


The weighted error between two spectra is calculated as a sum of the weighted errors 
between the real parts, the imaginary parts and the amplitudes of the spectra, 


EM) = (X r (/) - Y r {})f + (X, (/) - Y,U)f . (5.54) 

£.(/) = (X.(/)-W)) 2 , (5-55) 


E, = £>.(/) [*»«(/)£*(/) + (l - wM))E.(f) 


(5.56) 


where X T , Xu X a , Y r , Yi and Y a denote the real part, the imaginary part and the 
amplitude of the spectra X and Y respectively. We can also write 


E t = 




(5.57) 


where 

w'(f ) = W a (f)w ri (f ) and w"(f) = w a (f ) - w'{f ) . (5.58) 

The codebook CBq is searched for the entry which minimizes the weighted mean 
square error with respect to the difference between the target spectrum and the last 
coded spectrum. 

The codebook CB P must be able to quickly update the pitch pulse at the start 
of a voiced segment or within the voiced segment when the pitch pulse waveshape 
changes abruptly (which we have observed). Within regions where the pitch pulse 
changes slowly, the update of the pulse spectrum should be supplied by the codebook 
CBd ■ We observed excessive periodicity when the same entry of codebook CB V was 
repeatedly chosen as the coded spectrum. To eliminate that problem, the codebook 
which is used to code the current pulse spectrum is chosen as follows. 
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If the minimum weighted error of CBd is smaller than the minimum weighted 
error of CB V , the codebook CBd is used. In this case the coded spectrum is the sum 
of the last spectrum and the selected entry of the codebook CBd • If the minimum 
weighted error of CB P is smaller, we calculate the weighted error between the selected 
entry of CB V and the last coded spectrum. The codebook CB P is used if this error is 
larger than the error between the spectrum coded with CBd and the last spectrum. 
To make the above more clear, we write the last coded spectrum as Xj-u the target 
spectrum as Xi, and the current coded spectrum as X,-, the selected entry of the 
codebook CB P as Y p , the selected entry of the codebook CBd as Yd and the weighted 
error between spectra X and Y as E W (X,Y). We choose the current coded spectrum 
Xi = Y p if 


E w (Y p ,Xi) < E„(Xi- X + Y D ,X t - ) (5.59) 

and £ W (Y P , X^) > £;(X W + Yd, *i-i) • (5.60) 

Otherwise the coded spectrum is chosen as X t - = X,_i 4- Yd- 

Pitch Pulse Interpolation 

The pitch pulses are interpolated in the spectral domain. We use linear interpolation 
on complex spectra. The interpolation is linear in the number of pulses and not 
in time, i.e., a fixed number of pulses will have the same interpolation coefficients 
regardless of the interval between pulse positions. 

After the interpolation and the inverse transform, the time-domain voiced LP ex- 
citation is formed. The shape-interpolated pulses are placed at the pulse positions 
determined by the pulse-length interpolation. 

5.7 Coding the Noise Component 

The noise component of the LP excitation is coded with a CELP-like procedure via 
analysis-by-synthesis with the error calculated in the perceptually-weighted speech 
domain. 

The perceptually-weighted speech in the sub-frame l is created by filtering the 
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LP residual with the perceptually-weighted filter 


H{z) = 


l Mjiz) 

A(z) >1(72 z) ’ 


(5.61) 


where A(z ) and A(z) are, respectively, the LP coefficients and the quantized LP 
coefficients obtained as specified in Section 5.2. The parameters 71 and 72 regulate 
the strength of the perceptual weighting. We use 71 and 72 fixed at 1.0 and 0.8. The 
impulse response of the filter H(z) is written as h(n). 

It is useful to decompose the output of the filter into two components: one which 
depends on the past inputs and the other which depends on the samples of the current 
sub-frame. When the input of the filter for the current sub-frame is a zero vector, the 
output of the filter is the zero input response. The zero input response depends on 
the past inputs to the filterL When the memory of a filter is set to zero, the output 
of the filter is the zero-state response. The total output of the filter in the current 
sub-frame is a superposition of the zero input and the zero-state responses. The filter 
output depends therefore, at a given time, on (i) the filter coefficients, (ii) the past 
inputs to the filter, and (iii) the current input to the filter. We write the reconstructed 
weighted speech as 

«(,>(! r (< >). (5.62) 

This is the output of the filter with (i) the coefficients specified by H(z), (ii) the 
past inputs equal to the past samples of the LP excitation x, and (iii) the current 
input equal to the vector of samples of the current excitation sub-frame x® . With 
the symbols H ZI and H zs denoting, respectively, the zero input and the zero-state 
response of the filter H, we have 


«(*) (*“>) = n„\ % +H ls (xW) . (5.63) 

The zero-state response of a filter is equal to the convolution of the impulse re- 
sponse of the filter with the input of the filter. In a matrix notation this convolution 
can be written as 

H zs (xW) = Hx® , (5.64) 

tin perceptual weighting the filter coefficients change from frame-to-frame (and possibly from 
sub-frame to sub-frame). As a result the zero input response also depends on the past coefficients 
of the filter. Our notation does not show this dependency. 
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where H is the impulse response matrix specified as 



'MO) 

Mi) 

h(2) 

• • • h(N- 1) ' 


0 

M o) 

Ml) 

••• h(N — 2) 

H = 

0 

0 

MO) 

••• h{N- 3) 

\ 

0 

0 

0 

• • • MO) 


The reconstructed, weighted speech the sub-frame l is given by: 

=««S +W «(* (0 )- 


(5.65) 


(5.66) 

(5.67) 


With r denoting the LP residual, the perceptually-weighted speech of the sub- 
frame l is given by 


sj l) ~H(r)(r (l) ) (5.68) 

= 'H'Zs{ r ^) * (5.69) 

The error between the perceptually weighted speech and the reconstructed, weighted 
speech is 


sj l) - *J l) = -n zt W +U 2S {rV)-H 2S {xV) 

(5.70) 

= U z f [tU) +K zs (rW) - H zs (xW) 

(5.71) 

= U {T - x) {rV)-U zs (xV). 

(5.72) 

The notation %( r - x )(r^) represents the output for a filter with, (i) 

the coefficients 

specified by H (z), (ii) the past inputs equal to the difference between the past LP 
residual r and the past coded excitation x, (iii) the current input equal to the vector 

t* 0). We want to minimize 


s W - s ( 0 

(5.73) 

which is equivalent to finding such that 



(5.74) 
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is minimized. 

The Noise Codebooks 

The noise is coded on the basis of 40-sample long sub-frames and so the dimension 
of the noise codebook vectors is 40. The codebook vectors were populated with 
random independent Gaussian numbers, one value per five vector elements. The noise 
codebook entries are therefore sparse vectors with eight non-zero elements each. The 
positions of the non-zero elements are chosen randomly (uniform-distribution random 
selection of one element out of five) . The absolute gain of the non-zero elements is 
bounded by 0.5 and 1.5. 

For frames with no pitch pulses, the noise is coded with 8 bits per sub-frame and so 
the size of the noise codebook for those frames is 256. For the frames with pitch pulses, 
the noise is coded with only 4 bits per sub-frame; so the size of the noise codebook 
is only 16. With such a small codebook, it is possible that the same codebook entry 
might be chosen in consecutive sub-frames making the noise contribution periodic. 
Therefore, in coding the pitch pulse noise, we cycle over four codebooks of size 16 
(we use 64 vectors of the 256- vector codebook). 

Coding 

The noise component is added to the voiced part of the excitation. The voiced 
contribution is created, as described in the previous sections, by placing the pulse- 
shape-interpolated underlying pulses at pulse-length-interpolated positions. We have 

x = x p w + i„ w , (5.75) 

where x p ® and x n ^ denote, respectively, the pitch pulse and the noise contributions 
to the excitation x^. Now 

Mr, + «„(®„ ffl ) . (5.76) 

Based on (5.74) and (5.76), the target vector for the noise contribution to the exci- 
tation of the sub-frame l is 


n t “> = «(.-*) (r" 1 ) - . 


(5.77) 
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The target vector is specified in the perceptually weighted speech domain and corre- 
sponds to the vector ’K Z s( x n ^). We calculate and then search the noise code- 
book filtered with 7i zs (-) for the best mean-square-error match. Filtering a vector 
with % zs {.') is equivalent to multiplying the vector with the impulse response matrix 
H specified in (5.65). 

The relative ratio between the noise contribution and the pitch pulse contribution 
(r n/p) is coded differentially with 3 bits. The update rate of the gain ratio and of the 
total gain are the same (every 10 ms). 

5.8 Testing and Remarks 

In the PPE coder only approximate time synchrony with the original signal is main- 
tained. The relaxed synchrony renders the SNR measurement between the original 
and the coded speech inappropriate as a fidelity criterion. Subjective testing is nec- 
essary to evaluate the performance of the coder. 

The quality of the coded speech has been assessed in informal listening tests. The 
testing was performed through A-B comparison tests* with respect to the original 
recording and with respect to speech coded with the G.729 coder*. Since we did not 
use a post-filter in the PPE coder, to make the comparison fair, we used G.729 without 
post-filtering. The tests were performed on sentences from outside the database used 
for training the LSF and the pitch pulse shape codebooks. The testing was done over 
headphones and the listeners were considered untrained. 

The tests were performed with respect to every coded parameter. After coding 
the LP parameters, the pitch pulse positions and the gain (the pitch pulse shapes 
uncoded) the reconstructed speech was judged perceptually equivalent to the original, 
i.e., the listeners could not tell the difference between the original and reconstructed 
signals. When the coding of the noise component was added, audible differences 
between the original and the reconstructed speech appeared. We then switched to 
A-B comparisons with respect to the G.729 coder. At this point (coded: the LP 
parameters, the pitch pulse positions, the gain, the noise component), the PPE coded 

tin A-B testing the listener is asked to identify the better quality recording out of two consecu- 
tively played audio files presented in random order. 

*The G.729 coder is the ITU-T toll-quality 8 kb/s standard. The MOS score of the G.729 coder 
has been assessed at near 4.0. 
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speech was consistently better than, speech coded with the G.729 coder. 

To further verify our coding of the noise contribution, we used the G.729 Algebraic 
Structure Codebook in the place of our noise codebook. The reconstructed speech 
sounded slightly different but was neither better nor worse than the one coded with 
our noise codebook. The PPE reconstructed speech was still consistently preferred 
to the one coded with G 729. The Algebraic Structure Codebook has the advantage 
of very low computational complexity and we were pleased to confirm that it could 
be used in the PPE coder. 

The G.729 took upper hand after the pitch pulse shapes were coded. The distor- 
tion introduced by the PPE was often difficult to identify, particularly when the audio 
files were played over the speakers. Over the headphones, however, speech coded with 
the G.729 coder was, most of the time, judged better. 

Unfortunately, time did not permit more extensive or more formal testing of the 
implemented coder. From the tests performed, however, it is evident that the cod- 
ing of the underlying pitch pulses requires more attention. A strong indication of 
insufficient pitch pulse codebook training is the variability introduced to the coder 
performance when coding of the pitch pulses is added - the quality of some coded 
sentences was clearly better than other sentences also from outside the training set. 
We did not have an access to a sufficiently large database to train the pitch pulse 
codebooks properly. Ideally, we would like every pulse in the pulse-shape codebook to 
be extracted from a different voiced region. Yet a one-minute-long speech utterance 
consists of, typically, only up to six voiced regions. Better coding of the underlying- 
pitch-pulse shapes is the main area for future research. 

The Delay and Complexity of the Coder 

The algorithmic delay of the implemented coder is dictated by (i) the pitch extraction 
algorithm, (ii) the pitch pulse position coding. In the described implementation the 
look-ahead of the pitch pulse extraction algorithm is equal to one frame (20 ms) (see 
page 90). The pitch pulse position coding introduces an extra coding delay of up to 
one frame, depending on which pitch pulse position was chosen for coding in the last 
frame. When the coded pulse position in the last frame is near the frame beginning, 
the coding delay introduced by the pulse position coding could be up to 20 ms. The 
total algorit hmi c delay of the implemented coder is up to 60 ms. 
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The coder has been implemented in the C language in the UNIX environment and, 
to facilitate the testing of various aspects of the coding procedure, it is distributed 
over a number of programs. Particularly in the early development stages we needed 
a full control over every parameter involved in the coding algorithm. Therefore, at 
present, the coder is not optimized for an optimum computational performance. We 
estimate, however, that the coder can be implemented in real time on a single fixed- 
point DSP chip performing less than 40 MIPS (a typical high-performance DSP chip), 
based on the fact that coders of similar computational complexity have been shown 
to be implementable under such restrictions (see for example Kleijn et al. 1996). 


Chapter 6 

Final Remarks, Contributions and 
Future Work 


6.1 Summary of Our Work 

In Chapter 1 the problem of speech coding was introduced. We provided an overview 
of a number of existing coders which are considered the state-of-the-art for speech 
compression at low bit rates. The scope and the objectives of our research were 
outlined; our goal was to represent narrowband-limited speech (200 Hz-3.4 kHz) 
with a bit stream of about 4 kb/s with reconstructed speech close to or equivalent to 
toll quality. 

In Chapter 2 the principles of linear prediction coding were presented and the 
problem of modelling the LP excitation was examined. It was pointed out that, in 
linear predictive coding, the poor modelling of the voiced LP excitation is, to a large 
extent, responsible for degradation of the quality of the reconstructed speech. 

The analysis-by-synthesis technique and CELP coding were described. We ana- 
lyzed a number of improvements and modifications which have been proposed to bet- 
ter the representation of the LP excitation in the context of the CELP coder. Those 
included: harmonic noise weighting, constrained excitation, pitch synchronous inno- 
vation, comb filtering, and pitch sharpening. The paradigm of generalized analysis- 
by-synthesis coding was introduced. 

We classified the LP coders according to the analysis block length and analysis rate 
used. The CELP coder is an example of a coder with fixed-rate, fixed-block-length 
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analysis. The analysis performed in a WI coder is also fixed-rate but the analysis 
block lengths are pitch synchronous. 

Also in Chapter 2, the coding technique was outlined. In the PPE model the 
LP excitation is represented as a series of evolving pitch pulses obscured by noise. 
The pitch pulse-related analysis is pitch synchronous in the block lengths and the 
analysis rate. The coding of the noise contribution is based on generalized analysis- 
by-synthesis coding. 

In Chapter 3 we formulated the principles of the PPE model in more detail. The 
LP excitation is modelled as a series of underlying pulses buried in noise; the advan- 
tage of this approach is that we can model individually every pitch pulse and effec- 
tively separate the periodic (voiced) component and the noise (unvoiced) component 
of the excitation. The periodic part is identified via estimation of underlying pitch 
pulses based on noisy pulses extracted from the LP residual. The noise contribution 
is coded using a generalized analysis-by-synthesis procedure. 

In this chapter we set demanding goals for the pitch pulse extraction algorithm. 
The algorithm should identify individual pitch pulses in a way that the error between 
the underlying pitch pulses and the noisy pulses is minimized. At this point the esti- 
mation of the underlying pulses was simplified — we chose a pulse from a set of model 
pulses. With similar consecutive model pulses, the effect of the pulse segmentation is 
that the extracted noisy pulses are properly aligned. 

A number of estimation methods to reliably identify the underlying pitch pulses 
from the LP residual noisy observation were investigated. The described methods 
were: 


- linear filtering with fixed coefficients, 

- maximum ratio combining (linear filtering with adaptive coefficients), 

- error minimization between an underlying pulse and a number of noisy pulses 
(also linear filtering with adaptive coefficients), 

- an algorithm which minimizes the sum of weighted errors (i) between the con- 
secutive underlying pulses, (ii) between the underlying and the noisy pulses. 

It was concluded that the last method, the error minimiz ation algorithm, is the most 
effective. In the subsequent estimation of the underlying pulses we used the weighted- 
average version of the algorithm (the weighted-average algorithm is computationally 
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less expensive than the SVD algorithm and both algorithms are equally effective for 
the value of the error weight used). 

In Chapter 4 the pitch interpolation methods used in other coders were investi- 
gated. We argued against time-warping and supported our view with examples based 
on the LP residual. Separate interpolations of the pitch pulse length and the pitch 
pulse waveshapes were proposed. The separate interpolation improves the control 
over the evolution of the pitch pulse characteristics. We formulated the pitch pulse- 
length interpolation in terms of (i) the pulse length, (ii) the fundamental frequency, 
and (iii) the instantaneous pitch period. Finally, the spectral interpolation was ex- 
plained based on the interpolation used in WI and TFI. The PPE coder uses spectral 
interpolation to interpolate the pitch pulse waveshapes. 

In Chapter 5 an implementation of a PPE coder was presented. In the imple- 
mentation, the practical aspects of some of the key units of the PPE coder were 
developed. In particular, we designed a robust pitch pulse extractor which satisfies 
the tough requirements set by the model. A very low-rate pitch pulse position encod- 
ing scheme was devised in which the reconstructed signal maintains only a rough time 
synchrony with the original signal. It was verified that the relaxed time synchrony 
provides reconstructed speech equivalent to the original. 

Also in this chapter, the results of informal listening tests of the implemented 
PPE coder were discussed. The coder was tested with respect to the original record- 
ing and the speech coded with G.729 (toll quality speech at 8 kb/s). The coding of 
LP coefficients, the pitch pulse positions and the gain provided speech quality equiv- 
alent to the original. After coding of the noise component (the pitch pulses uncoded), 
the PPE reconstructed signal was still better than speech coded with G.729. Only 
after coding the shapes of the pitch pulses (speech fully coded), did the quality of 
the reconstructed signal seem to be, in some cases, below the speech processed with 
G.729. We concluded that the codebooks and the weighted error criterion used for 
coding the pitch pulses still need more attention. 

6.2 PPE Coding Versus WI Coding 

The PPE model is akin in many ways to WI coding. As a part of the fin«il remarks we 
would like to compare the two coding techniques. Both methods (i) use LP analysis, 
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(ii) extract individual pitch cycles, (iii) separate the noisy part and the slowly evolving 
part of the pitch waveform, (iv) encode the pitch information and the pitch cycle 
waveshape infrequently, and (v) use interpolation to reconstruct intermediate pulse 
waveforms. The main differences between the implemented PPE coder and a W1 
coder can be summarized as follows. 

o PPE: The pitch pulses are identified based on the error between the model 
pulses and the LP residual. 

WI: The pitch period is estimated from the LP residual based on the autocor- 
relation function calculated from the residual. 

o PPE: The pitch pulses are extracted pitch synchronously, one pitch pulse per 
pitch period. The rate of extraction depends on the pitch pulse lengths. 

WI: The characteristic waveforms are extracted with fixed rate. The lengths 
of the waveforms are based on the estimated pitch period but the extraction is 
pitch asynchronous. 

o PPE: Every sample is used in one and only one pitch pulse. The case in which 
the pulses overlap and a new pulse is superimposed on the tail of the old pulse 
may be considered but so far has not been used. 

WI: With higher extraction rates the characteristic waveforms overlap, with 
lower extraction rates some of the residual samples are not included in any of 
the waveforms. 

o PPE: The position of a pitch pulse and number of pitch pulses between the 
pulse positions is coded and transmitted. 

WI: The pitch period is coded and transmitted. 

o PPE: The frequency transformation is done with fixed dimension DFT (the 
FFT algorithm is used) . 

WI: The transformation into the frequency domain is performed with variable 
dimension DFT. 

o PPE: The alignment is incorporated into the pitch pulse extraction procedure. 
The extraction of the pulses is such that the pulses are aligned for maximum 
correlation. There is no “circular shift” of the waveforms. 

WI: Characteristic waveforms are aligned using periodic extension of the ex- 
tracted pitch pulses. The waveforms are extracted pitch asynchronously and 
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then undergo a “circular shift” which brings them into alignment. As a result 
of the alignment, a pulse represented by a characteristic waveform might have 
its front after its tail. 

o PPE: The underlying pitch pulses and the pitch pulse noise are estimated via 
adaptive filtering or other methods which minimize a specified weighted error 
criterion. At present, we estimate underlying pulses in the time domain but the 
analysis in the frequency domain is also possible. 

WI: A slowly evolving waveform (SEW) and a rapidly evolving waveform (REW) 
are obtained by linear filtering of the characteristic waveforms. The linear filter 
has fixed coefficients. 

o PPE: There is a separate interpolation of the pitch pulse lengths and of the pitch 
pulse waveshapes. Time-warping of the residual is avoided and approximate 
synchrony with the original is maintained. 

WI: The spectral interpolation results in time warping of the residual and the 
time synchrony with the original is not maintained. In long voiced sections, the 
synthesized waveform may gain or lose pulses. 

Speech coders are usually divided into waveform coders and parametric coders. 
The difference has become blurred over the years with many coders resisting clear 
classification into one group or the other. In general, coders which reconstruct a 
signal which converges to the original with decreasing quantization error are called 
waveform coders. The coders whose reconstructed signal does not converge to the 
original are called parametric coders (Kleijn and Paliwal 1995). The LP analysis- 
by-synthesis coders are classified as waveform coders while sinusoidal coders and 
waveform interpolation coders are classified as parametric coders. The PPE coder 
can in fact be seen either as a waveform coder or as a parametric coder depending 
on the bit rate used in the encoding of the pitch information; more specifically it 
depends on how the pitch length is interpolated*. Provided the bit rate for the 
pitch information is high enough so that the pitch pulse lengths do not need to be 
interpolated (about 44 bits per 160 samples) the PPE model leads to a waveform 

*ln most of the coders in which the pitch interpolation is used, the interpolation mechanism is 
incorporated into the coder structure in such a way that, even for a high bit rate coding, the original 
and the reconstructed signals do not converge. In particular they do not converge in the coders 
which use time warping (e.g., in WI and in sinusoidal coding). 
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coder. When the number of bits for the pitch information is lower as in the case of 
the implemented coder (we use 8 bits per 160 sample frame), the model results in a 
parametric coder. 

6.3 Our Contributions 

In this work we developed a new speech coding model which was designed for ap- 
propriate representation of the LP excitation, particularly during the quasi-periodic 
voiced segments. The model combines a number of different coding techniques which 
we described and analyzed throughout the thesis. The proposed method also offers 
new solutions with respect to the problems identified in other speech coding systems. 

We developed estimation techniques, which identify underlying pitch pulses in a 
noisy observation based on noise error minimization. A new pitch pulse extraction 
algorithm was implemented to satisfy the demanding requirements of the P?E model. 
Our pitch extraction method combines (i) pitch period estimation, (ii) pitch pulse 
extraction, and (iii) waveform alignment. 

We suggested and implemented coding of the pitch information based on the 
positions of the pitch pulses. With this encoding we maintain the low bit-rate pitch 
coding rate of the WI coder and the time synchrony of a sinusoidal coder. 

In the PPE coder the time and the frequency domain analysis are combined in a 
unique way. In particular, we proposed and implemented separate interpolations on 
the pitch pulse lengths (time domain) and the pitch pulse waveshapes (frequency do- 
main). We also incorporated into the PPE coder the generalized analysis-by-synthesis 
paradigm. The noise component of the LP excitation is coded in the time domain 
using analysis-by-synthesis with respect to the modified (through pitch pulse-length 
interpolation) speech signal. 

We implemented a 4 kb/s speech coder which produces very high quality coded 
speech. In informal listening tests the coder was judged to provide toll quality re- 
constructed speech when all the parameters were coded except for the coding of the 
underlying pitch pulse shapes. At present, quantizing the pitch pulses introduces 
slight distortion which we believe can be reduced by better training of the codebooks 
and by improving the weighted error criterion. 
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6.4 Claims of Originality 

In this thesis, we have developed a speech coder based on several new concepts. The 
novel aspects include: 

1. Modelling the LP excitation as a series of underlying pitch (glottal) pulses 
obscured by noise. Both the noise and the underlying pulses are present in the 
excitation simultaneously (in varying degrees). 

2. An algorithm for robust identification of pitch pulse boundaries. Segmentation 
of the LP residual into noisy pitch pulses is based on an error criterion with 
respect to a set of model pulses. 

3. Estimating the underlying pitch pulses based on the error between the underly- 
ing pulses and the noisy pulses and the error between the consecutive underlying 
pitch pulses. This approach is better than simple filtering of the pulses as it 
takes into account the evolution of the underlying pulse shapes. 

4. Separate interpolation of the pitch pulse shapes and of the pitch pulse lengths. 
In our interpolation of the pulse shapes, time warping of the pulses is avoided. 
In the interpolation of the pitch pulse lengths, we determine the position of 
every pulse. 

5. The encoding of the pitch information and the reconstruction of the coded 
speech with relaxed time synchrony with respect to the original signal. 

6.5 Future Work 

We demonstrated that the proposed model is capable of achieving near toll quality 
speech coding at rates around 4 kb/s. The implemented coder, however, was built 
in an experimental setting. The PPE model provides a powerful framework in which 
many, so far independent, analysis blocks may be integrated providing for a possibility 
of more optimal performance. The potential of the PPE model has not yet been fully 
exploited within the presented implementation. 

The prediction of the underlying pitch pulse at the receiver is simply the last 
encoded pitch pulse. The problem of reliable prediction was not addressed in this 
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thesis although it was briefly investigated. The problem of prediction of the current 
underlying pitch pulse from the past pulses is worth pursuing. 

The estimation of the underlying pitch pulses is executed in the time domain. 
The developed estimation algorithm can be, however, applied to the spectra of the 
extracted pulses. Differences between the estimation in the time domain and in the 
frequency domain could be further explored. 

More research should be done to enhance the coding of the underlying pitch pulses. 
This might include a large database training of the pitch pulse codebooks, and more 
careful design of the weighted error criterion used in selecting the codebook entries. In 
our tests we experienced an excellent performance of the coder on some test sentences 
(both male and female) and poorer performance in some other cases. 

A large number of computational improvements are possible to reduce the present 
complexity of the coder. In particular, the pitch pulse extraction could be further 
simplified and the handling of the sub-sample resolution could be made more efficient. 

Combining the LP analysis with the extraction and estimation of the underlying 
pitch pulses is yet another area of possible research. Initial experiments conducted in 
this direction are very promising (Zad-Issa and Kabal 1997). 

The coder should also be examined with reference to background acoustic noise 
and bit sensitivity to transmission errors. Possible modifications to make the coder 
more robust could be investigated. 

Most of the coders use pre- and post-filtering to enhance the perceptual quality of 
the reconstructed signal. Often the filters are designed for the specific implemented 
method and use the coded parameters as guidance to the strength and type of required 
filtering. A PPE-specific pre- and post-filter for the PPE coder could be designed as 
a possible extension of our work. 

The PPE model was explained and implemented in the setting of pitch syn- 
chronous analysis. Some of the ideas developed in this thesis can be used however 
in conjunction with many other coding methods. In particular the suggested pitch 
pulse estimation techniques can be used in the context of WI to obtain the slowly 
evolving waveforms and in the context of CELP to obtain more perceptually relevant 
adaptive codebooks. 

A form of the PPE extraction algorithm could be used as a speech pre-processing 
uni t in a CELP coder in order to eliminate fractional-pitch coding. The pulse positions 
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would not be interpolated but only rounded to the nearest integer value (to one-sample 
resolution). 

We believe that the PPE paradigm of coding accurately reflects speech production 
with the parameterization appropriate for generating high quality speech at very low 
bit rates. We hope that work will continue to evolve the PPE technique into a robust, 
low-complexity/high-quality speech coder. 
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The Pitch Pulse Length Interpolation Algorithm 


The pitch pulse length interpolation problem is formulated as follows: 

Given the last pitch pulse length, segment the block of T samples into 
N pitch pulses so that the sum of differences between consecutive pulse 
lengths is minimized. 



/* Input: 

lastP - the last pitch pulse length 
T - number of samples between the pitch pulse position 

coded in the last frame and the pitch pulse position 
coded in the current frame 

N - number of pulses coded in this frame (number of 
pulses in T samples) 

Output : 

PLen - the lengths of the N pulses coded in this frame 

*/ 


void 

ppSegment (lastP, T, N, PLen) 
int lastP, T, N, PLenD ; 



int D, c CMAX_NO_OF.PULSES] , i, k, n ; 

/* If the length of the last pitch pulse is not available, set its 
value to the average pitch pulse length of the current frame 

*/ 

if ( lastP = 0 ) 

lastP * (int) round ((float) T/N) 
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/* The number of samples which would have to be added (or taken 
away) from N pulses of length lastP 

*/ 

D « T - round (N*lastP) ; 

if ( D == 0 ) { 

for ( i=0 ; i<N ; i++ ) 

PLenCi] * lastP ; 

> 

else { 

/* Calculate the differences between the lengths of consecutive 
pitch pulses. The sum of these differences is minimized 

*/ 

i * (N+l)*N/2 ; 

n = abs(D)/i ; /* integer division */ 

k = abs(D)*/,i ; 
for ( i=0 ; i<N ; i++ ) t 
c[i] = n ; 
if ( k >= N-i ) t 
c [i] ++ ; 
k -* M-i r 

> 

> 

/* Calculate the lengths of the coded pulses */ 
if C D > 0 ) -C 
PLenCO] = lastP + c[0] ; 
for ( i=l ; i<N ; i++ ) 

PLenCi] = PLenCi-1] +■ cCi] ; 

> 

else { 

PLen[0] * lastP - cCO] ; 
for ( i=l ; i<N ; i++- ) 

PLenCi] = PLenCi-1] - cCi] ; 

> 

> 

> 
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Appendix B 

Weighted Minimum Square Linear Fit 




In this description we use the following notation: M(i) is the i-th column vector of the 
matrix M, M[i ] is the i-th row vector of the matrix M, M T is the transpose of the matrix 
(vector) M. 

The problem is to find a line in D-dimensional space which will minimize the error 23, 

E =( X — Y 1 )W , ran 

D*N DxN Dx2 2xN NxN ' ' 

where X is a matrix of iV given vectors of dimension D, T is a matrix of reference positions 
of the given vectors, Y is a matrix of two vectors describing the D-dimensional line, and W 
is a diagonal matrix of weights specifying the relative importance of the N given vectors. 

The vector T[0] specifies the positions of the vectors of the matrix X normalized with 
respect to the positions of the vectors of the matrix Y. The vector Y (0) corresponds to 
position 0-0 and the vector K(l) corresponds to position 1.0. The vector T[lj is given by 

T[l] = 1 — T[0] . (B.2) 


where 1 is a vector of ones. 

Equation (B.l) is rearranged as 



YTW = XW + E. 

(B.3) 

We write 

and we multiply (B.3) by 

T w = TW 

2xN 

(B.4) 


YT w Tl = XWTi + ETl . (B.5) 
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From the orthogonality principle we have 



A 

il 

o 

(B.6) 

With 

WT t 

A 

Nx2 W W 

(B-7) 

the imknown Y is given by 

Y = X A . 

Dx2 DxN N*2 

(B.8) 


If y(0) is fixed then we write (B.l) as 

e = ( x - y(o) t[ o] - r (i) t[i] ) w . 

DxN DxN DxX lxN Dxl IxN NxN 

Now 

y (i)T[i] w = {x- y (o)T[o]) w + e . 

We write 

t w = T[1]W 

IxN 

and multiply (B.10) by 

y(l)* w tw = {X - y(0)T[0]) Wtl + Etl . 


(B.9) 

(B.10) 

(B.ll) 

(B.12) 


From the orthogonality principle we have 


With 

the unknown y(l) is given by 


E%=Q. 


a = 

Nxl 


22 £ 
f t T 


(B.13) 


(B.14) 


y(I) = ( X -y( 0 )T[Ol) a 

Dxl DxN Dxl IxN Nxl 


(B.15) 


The matrix A and the vector a can be pre-calculated for a given number of vectors N 
and fixed weights W. 
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